0. Abstract

1. Introduction

The website Kaggle.com, an online community of data scientists, offers many clean, formatted data sets on which analysis can be performed. For this project, I used the Kaggle data set, “House Sales in King County, USA” which includes all home sales from May 2014 to May 2015. The city of Seattle, Washington, USA lies on the border of Kings County. Seattle is notorious for having some of the most expensive and lavish homes in the United States. The data set has a wide variety of homes ranging from small homes to massive mansions containing over 30 rooms.

house <- read.csv('kc_house_data.csv')
head(house)
##           id            date   price bedrooms bathrooms sqft_living sqft_lot
## 1 7129300520 20141013T000000  221900        3      1.00        1180     5650
## 2 6414100192 20141209T000000  538000        3      2.25        2570     7242
## 3 5631500400 20150225T000000  180000        2      1.00         770    10000
## 4 2487200875 20141209T000000  604000        4      3.00        1960     5000
## 5 1954400510 20150218T000000  510000        3      2.00        1680     8080
## 6 7237550310 20140512T000000 1225000        4      4.50        5420   101930
##   floors waterfront view condition grade sqft_above sqft_basement yr_built
## 1      1          0    0         3     7       1180             0     1955
## 2      2          0    0         3     7       2170           400     1951
## 3      1          0    0         3     6        770             0     1933
## 4      1          0    0         5     7       1050           910     1965
## 5      1          0    0         3     8       1680             0     1987
## 6      1          0    0         3    11       3890          1530     2001
##   yr_renovated zipcode     lat     long sqft_living15 sqft_lot15
## 1            0   98178 47.5112 -122.257          1340       5650
## 2         1991   98125 47.7210 -122.319          1690       7639
## 3            0   98028 47.7379 -122.233          2720       8062
## 4            0   98136 47.5208 -122.393          1360       5000
## 5            0   98074 47.6168 -122.045          1800       7503
## 6            0   98053 47.6561 -122.005          4760     101930
#p1 <- get_googlemap("king county") %>% ggmap
#p1 + geom_point(data = house, aes(x = long, y = lat), alpha = 0.03, colour = "red")
#ggsave("map.png")
knitr::include_graphics("map.png")

2. Settings

Below are the entire columns in this dataset:

As id, date, latitude, longitude, and zipcode are unnecessary columns for analysis, let’s just drop them.

house <- house %>% dplyr::select(-id, -date, -lat, -long, -zipcode)
head(house)
##     price bedrooms bathrooms sqft_living sqft_lot floors waterfront view
## 1  221900        3      1.00        1180     5650      1          0    0
## 2  538000        3      2.25        2570     7242      2          0    0
## 3  180000        2      1.00         770    10000      1          0    0
## 4  604000        4      3.00        1960     5000      1          0    0
## 5  510000        3      2.00        1680     8080      1          0    0
## 6 1225000        4      4.50        5420   101930      1          0    0
##   condition grade sqft_above sqft_basement yr_built yr_renovated sqft_living15
## 1         3     7       1180             0     1955            0          1340
## 2         3     7       2170           400     1951         1991          1690
## 3         3     6        770             0     1933            0          2720
## 4         5     7       1050           910     1965            0          1360
## 5         3     8       1680             0     1987            0          1800
## 6         3    11       3890          1530     2001            0          4760
##   sqft_lot15
## 1       5650
## 2       7639
## 3       8062
## 4       5000
## 5       7503
## 6     101930

In datset, there are total 21,613 obeservations, and 16 columns with no missing values.

nrow(house) 
## [1] 21613
ncol(house)
## [1] 16
is.null(house)
## [1] FALSE

Let’s reorder columns to make the dataset more readable.

col_order <- c("price", "bedrooms", "bathrooms", "floors", "waterfront", "view", "condition", "grade", "yr_built",
               "yr_renovated", "sqft_living", "sqft_lot", "sqft_living15", "sqft_lot15")
house <- house[, col_order]

According to our dataset, bathrooms column had some decimal observations, so let’s round it up.

As yr_built and renovated are columns are continuous variables, each of them stating the year of a house built / renovated, let’s just convert them into categorical variables. For yr_built, we can chunk it up to 5 categories with 20 years of interval, and for renovated, we can note them 1, if renovated, 0 otherwise.

house$bathrooms <- round(house$bathrooms)

house$yr_built <- case_when(
  (1900 <= house$yr_built) &  (house$yr_built< 1920) ~ 0,
  (1920 <= house$yr_built) &  (house$yr_built< 1940) ~ 1,
  (1940 <= house$yr_built) &  (house$yr_built< 1960) ~ 2,
  (1960 <= house$yr_built) &  (house$yr_built< 1980) ~ 3,
  (1980 <= house$yr_built) &  (house$yr_built< 2000) ~ 4,
  (2000 <= house$yr_built) ~ 5)

house$renovated <- ifelse(house$yr_renovated != 0, 1, 0)

house <- house %>% dplyr::select(-yr_renovated)

Before diving into EDA, let’s split the dataset into train data and test data.

set.seed(1) ##for reproducibility to get the same split
sample<-sample.int(nrow(house), floor(.80*nrow(house)), replace = F)
train<-house[sample, ] ##training data frame
test<-house[-sample, ] ##test data frame
head(train)
##        price bedrooms bathrooms floors waterfront view condition grade yr_built
## 17401 550000        3         2    1.5          0    0         3     8        3
## 4775  275000        4         2    2.0          0    0         3     7        4
## 13218 455000        5         2    2.0          0    0         3     6        4
## 10539 384950        3         2    2.0          0    0         3     7        5
## 8462  140000        2         1    1.0          0    0         2     6        2
## 4050  925000        3         2    2.0          0    0         5     7        2
##       sqft_living sqft_lot sqft_living15 sqft_lot15 renovated
## 17401        2910    35200          2590      37500         0
## 4775         2120     6754          2120       6937         0
## 13218        1510     3000          1610       3600         0
## 10539        1860     3690          1870       4394         0
## 8462          900     6400          1350       6405         0
## 4050         2690     7000          1800       6435         0

3. Exploratory Data Analysis

By using ggpairs, we can check overall relationship among columns with the repsonse variable, price, and distribution of each column.

First, for physical attributes of houses (bedrooms, bathrooms, and floors), bathrooms had pretty good correlation with price. Slightly lesser for bedrooms and floors.

house_1 <- train %>% dplyr::select(price, bedrooms, bathrooms, floors)
ggpairs(house_1)

By looking at boxplot across each colum and category, price did not necessarily proportional to number of bedrooms and number of floors. In short, the most pricy house did not have the largest number of bedrooms or floors. However, in terms of number of bathrooms, price tend to increase as the number of bathroom increases. In our dataset, the most pricy house had the largest number of bedrooms. This is the reason why among these three columns, bathrooms had the highest corrleation with price.

p1 <- ggplot(train, aes(x = as.factor(bedrooms), y = price, fill = as.factor(bedrooms))) +
  geom_boxplot() +
  labs(x = "Number of Bedrooms", y = "Price", title = "Price by Number of Bedrooms", fill = "Bedrooms")

p2 <- ggplot(train, aes(x = as.factor(bathrooms), y = price, fill = as.factor(bathrooms))) +
  geom_boxplot() +
  labs(x = "Number of Bathrooms", y = "Price", title = "Price by Number of Bathrooms", fill = "Bathrooms")

p3 <- ggplot(train, aes(x = as.factor(floors), y = price, fill = as.factor(floors))) +
  geom_boxplot() +
  labs(x = "Number of Floors", y = "Price", title = "Price by Number of Floors", fill = "Floors")

ggarrange(p1, p2, p3,
                    ncol = 1, nrow = 3)

Other three columns (view, condition, waterfront), there was a slight correlation between view and price (0.395). There was also somewhat slight correlation between waterfront and price (0.273). What’s notable is here is that the condition and price of a house had nearly zero correlation (0.015). However, we should be careful when analyzing this figure as nearly zero correlation does not necessarily mean they are totally unrelated, and high correlation does not necessarily lead to a causation, A causes B, or the opposite.

house_2 <- train %>% dplyr::select(waterfront, view, condition, price)
ggpairs(house_2)

According to the boxplot, houses with in the vicinity of waterfront and good view tend to be pricy. However, the condition of a house was not a crucial factor.

p4 <- ggplot(train, aes(x = as.factor(waterfront), y = price, fill = as.factor(waterfront))) +
  geom_boxplot() +
  labs(x = "Waterfront", y = "Price", title = "Price by with / without waterfront", fill = "Waterfront")

p5 <- ggplot(train, aes(x = as.factor(view), y = price, fill = as.factor(view))) +
  geom_boxplot() +
  labs(x = "View", y = "Price", title = "Price by View", fill = "View")

p6 <- ggplot(train, aes(x = as.factor(condition), y = price, fill = as.factor(condition))) +
  geom_boxplot() +
  labs(x = "Condition", y = "Price", title = "Price by Condition", fill = "Condition")

ggarrange(p4, p5, p6, ncol = 1, nrow = 3)

It turns out that among three columns, grade of a house had a notably high correlation with price. Also, yr_built had pretty notable correlation with price. However, renovated had low correlation with price.

house_3 <- train %>% dplyr::select(grade, yr_built, renovated, price)
ggpairs(house_3)

According to the boxplot, there exists a gradual increase in price along grade categories. Also, more newly-built houses tend to have slightly higher prices. However, there was no big difference in price between renovated house and unrenovated house.

p7 <- ggplot(train, aes(x = as.factor(grade), y = price, fill = as.factor(grade))) +
  geom_boxplot() +
  labs(x = "Grade", y = "Price", title = "Price by Grade", fill = "Grade")

p8 <- ggplot(train, aes(x = as.factor(yr_built), y = price, fill = as.factor(yr_built))) +
  geom_boxplot() +
  labs(x = "Year built", y = "Price", title = "Price by Year built", fill = "Year built")

p9 <- ggplot(train, aes(x = as.factor(renovated), y = price, fill = as.factor(renovated))) +
  geom_boxplot() +
  labs(x = "Renovated", y = "Price", title = "Price by Renovated", fill = "Renovated")

ggarrange(p7, p8, p9, ncol = 1, nrow = 3)

When we take a look at sqft_lot and sqft_living columns, sqft_living had a pretty high correlation with price. Houses with large square foot of living generally had higher prices. However, sqft_lot was not highly correlated with price.

house_4 <- train %>% dplyr::select(sqft_living, sqft_lot, price)
ggpairs(house_4)

As such, the slope of sqft_living was pretty steep, while that of sqft_lot was more flat.

p10 <- ggplot(train, aes(x = sqft_living, y = price)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(x = "Sqft Living", y = "Price", title = "A Scatterplot of Sqft Living vs Price") 

p11 <- ggplot(train, aes(x = sqft_lot, y = price)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(x = "Sqft Lot", y = "Price", title = "A Scatterplot of Sqft Lot vs Price")


ggarrange(p10, p11, ncol = 1, nrow = 2)
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

In line with sqft columns, sqft_living15 also had pretty notable correlation with price, while sqft_lot15 not.

house_5 <- train %>% dplyr::select(sqft_living15, sqft_lot15, price)
ggpairs(house_5)

Likely, sqft_living15 column had slightly steeper slope than that of sqft_lot15.

p12 <- ggplot(train, aes(x = sqft_living15, y = price)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(x = "Sqft Living15", y = "Price", title = "A Scatterplot of Sqft Living 15 vs Price") 

p13 <- ggplot(train, aes(x = sqft_lot15, y = price)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(x = "Sqft Lot15", y = "Price", title = "A Scatterplot of Sqft Lot 15 vs Price")


ggarrange(p12, p13, ncol = 1, nrow = 2)
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

On top of direct correlation with price, through heatmap, we can check general corrleation among each column. The more red-shaded the square is, the higher correlated two predictors are. On bottom right, predictors (bedrooms, bathrooms, sqft_living, grade, and sqft_living) seem to be pretty correlated with each other.

mydata.cor <- cor(house)

palette = colorRampPalette(c("green", "white", "red")) (20)
heatmap(x = mydata.cor, col = palette, symm = TRUE, main = "A Heatmap of All Columns")

3. Modeling

As there are a large number of predictors in our dataset, we can first filter useful predictors by using automated search procedures. We can automate the process of finding useful / useless columns. Let’s implement stepwise regression, forward selection, and backward elimination to choose predictors.

< Stepwise Regression >

regnull <- lm(price ~ 1, data = train)
regfull <- lm(price ~ ., data = train)
step(regnull, scope = list(lower = regnull, upper = regfull), direction = "both")
## Start:  AIC=442775.7
## price ~ 1
## 
##                 Df  Sum of Sq        RSS    AIC
## + sqft_living    1 1.1329e+15 1.1553e+15 430962
## + grade          1 1.0242e+15 1.2640e+15 432516
## + sqft_living15  1 7.8905e+14 1.4992e+15 435466
## + bathrooms      1 6.0414e+14 1.6841e+15 437477
## + view           1 3.5639e+14 1.9318e+15 439850
## + bedrooms       1 2.0920e+14 2.0790e+15 441120
## + waterfront     1 1.7097e+14 2.1172e+15 441435
## + floors         1 1.6137e+14 2.1268e+15 441513
## + renovated      1 3.5656e+13 2.2525e+15 442506
## + sqft_lot       1 2.0037e+13 2.2682e+15 442626
## + sqft_lot15     1 1.5906e+13 2.2723e+15 442657
## + yr_built       1 5.6435e+12 2.2826e+15 442735
## + condition      1 3.4753e+12 2.2847e+15 442751
## <none>                        2.2882e+15 442776
## 
## Step:  AIC=430962
## price ~ sqft_living
## 
##                 Df  Sum of Sq        RSS    AIC
## + view           1 9.6282e+13 1.0590e+15 429459
## + grade          1 9.6101e+13 1.0592e+15 429462
## + waterfront     1 9.0018e+13 1.0653e+15 429561
## + yr_built       1 6.8189e+13 1.0871e+15 429912
## + bedrooms       1 3.3062e+13 1.1223e+15 430462
## + renovated      1 1.6775e+13 1.1386e+15 430711
## + sqft_living15  1 1.6529e+13 1.1388e+15 430715
## + condition      1 1.3494e+13 1.1418e+15 430761
## + sqft_lot15     1 6.0106e+12 1.1493e+15 430874
## + sqft_lot       1 3.2768e+12 1.1520e+15 430915
## + bathrooms      1 2.3654e+12 1.1530e+15 430929
## + floors         1 3.1999e+11 1.1550e+15 430959
## <none>                        1.1553e+15 430962
## - sqft_living    1 1.1329e+15 2.2882e+15 442776
## 
## Step:  AIC=429459.5
## price ~ sqft_living + view
## 
##                 Df  Sum of Sq        RSS    AIC
## + grade          1 8.4847e+13 9.7420e+14 428018
## + yr_built       1 4.6244e+13 1.0128e+15 428690
## + waterfront     1 3.8196e+13 1.0208e+15 428826
## + bedrooms       1 2.1640e+13 1.0374e+15 429105
## + renovated      1 1.0244e+13 1.0488e+15 429293
## + condition      1 9.3046e+12 1.0497e+15 429309
## + sqft_living15  1 9.2323e+12 1.0498e+15 429310
## + sqft_lot15     1 6.9146e+12 1.0521e+15 429348
## + sqft_lot       1 3.9236e+12 1.0551e+15 429397
## + bathrooms      1 2.3140e+12 1.0567e+15 429424
## + floors         1 1.7448e+12 1.0573e+15 429433
## <none>                        1.0590e+15 429459
## - view           1 9.6282e+13 1.1553e+15 430962
## - sqft_living    1 8.7277e+14 1.9318e+15 439850
## 
## Step:  AIC=428017.6
## price ~ sqft_living + view + grade
## 
##                 Df  Sum of Sq        RSS    AIC
## + yr_built       1 1.1066e+14 8.6354e+14 425935
## + waterfront     1 4.0072e+13 9.3412e+14 427293
## + condition      1 2.0864e+13 9.5333e+14 427645
## + renovated      1 1.3455e+13 9.6074e+14 427779
## + bedrooms       1 1.1057e+13 9.6314e+14 427822
## + sqft_lot15     1 5.0887e+12 9.6911e+14 427929
## + floors         1 2.7444e+12 9.7145e+14 427971
## + sqft_lot       1 2.6566e+12 9.7154e+14 427972
## <none>                        9.7420e+14 428018
## + bathrooms      1 1.1035e+11 9.7408e+14 428018
## + sqft_living15  1 1.4925e+09 9.7419e+14 428020
## - grade          1 8.4847e+13 1.0590e+15 429459
## - view           1 8.5029e+13 1.0592e+15 429462
## - sqft_living    1 1.6553e+14 1.1397e+15 430729
## 
## Step:  AIC=425934.8
## price ~ sqft_living + view + grade + yr_built
## 
##                 Df  Sum of Sq        RSS    AIC
## + waterfront     1 4.0818e+13 8.2272e+14 425100
## + bedrooms       1 1.1869e+13 8.5167e+14 425698
## + bathrooms      1 6.1747e+12 8.5736e+14 425813
## + floors         1 4.8417e+12 8.5869e+14 425840
## + sqft_lot15     1 3.7083e+12 8.5983e+14 425862
## + sqft_lot       1 2.1378e+12 8.6140e+14 425894
## + renovated      1 1.5630e+12 8.6197e+14 425906
## + condition      1 1.5122e+12 8.6202e+14 425907
## + sqft_living15  1 3.1983e+11 8.6322e+14 425930
## <none>                        8.6354e+14 425935
## - view           1 5.0196e+13 9.1373e+14 426910
## - yr_built       1 1.1066e+14 9.7420e+14 428018
## - grade          1 1.4926e+14 1.0128e+15 428690
## - sqft_living    1 1.6156e+14 1.0251e+15 428898
## 
## Step:  AIC=425099.6
## price ~ sqft_living + view + grade + yr_built + waterfront
## 
##                 Df  Sum of Sq        RSS    AIC
## + bedrooms       1 9.8179e+12 8.1290e+14 424894
## + bathrooms      1 6.7148e+12 8.1600e+14 424960
## + floors         1 4.1674e+12 8.1855e+14 425014
## + sqft_lot15     1 3.8603e+12 8.1886e+14 425020
## + sqft_lot       1 1.9265e+12 8.2079e+14 425061
## + condition      1 1.5979e+12 8.2112e+14 425068
## + renovated      1 7.6988e+11 8.2195e+14 425085
## + sqft_living15  1 6.2102e+11 8.2210e+14 425089
## <none>                        8.2272e+14 425100
## - view           1 1.6942e+13 8.3966e+14 425450
## - waterfront     1 4.0818e+13 8.6354e+14 425935
## - yr_built       1 1.1141e+14 9.3412e+14 427293
## - grade          1 1.5190e+14 9.7462e+14 428027
## - sqft_living    1 1.6005e+14 9.8276e+14 428171
## 
## Step:  AIC=424894
## price ~ sqft_living + view + grade + yr_built + waterfront + 
##     bedrooms
## 
##                 Df  Sum of Sq        RSS    AIC
## + bathrooms      1 1.0157e+13 8.0274e+14 424679
## + sqft_lot15     1 5.2950e+12 8.0760e+14 424783
## + floors         1 4.2569e+12 8.0864e+14 424805
## + sqft_lot       1 2.8889e+12 8.1001e+14 424834
## + condition      1 2.1432e+12 8.1076e+14 424850
## + renovated      1 7.1013e+11 8.1219e+14 424881
## + sqft_living15  1 4.9966e+11 8.1240e+14 424885
## <none>                        8.1290e+14 424894
## - bedrooms       1 9.8179e+12 8.2272e+14 425100
## - view           1 1.4821e+13 8.2772e+14 425204
## - waterfront     1 3.8767e+13 8.5167e+14 425698
## - yr_built       1 1.1213e+14 9.2503e+14 427126
## - grade          1 1.3878e+14 9.5168e+14 427617
## - sqft_living    1 1.5617e+14 9.6907e+14 427930
## 
## Step:  AIC=424678.6
## price ~ sqft_living + view + grade + yr_built + waterfront + 
##     bedrooms + bathrooms
## 
##                 Df  Sum of Sq        RSS    AIC
## + sqft_lot15     1 4.7365e+12 7.9801e+14 424578
## + floors         1 3.2782e+12 7.9946e+14 424610
## + sqft_lot       1 2.6152e+12 8.0013e+14 424624
## + condition      1 1.7809e+12 8.0096e+14 424642
## + sqft_living15  1 1.3301e+12 8.0141e+14 424652
## + renovated      1 2.6833e+11 8.0247e+14 424675
## <none>                        8.0274e+14 424679
## - bathrooms      1 1.0157e+13 8.1290e+14 424894
## - bedrooms       1 1.3260e+13 8.1600e+14 424960
## - view           1 1.3572e+13 8.1631e+14 424967
## - waterfront     1 3.9085e+13 8.4183e+14 425499
## - sqft_living    1 1.1018e+14 9.1292e+14 426900
## - yr_built       1 1.2138e+14 9.2412e+14 427111
## - grade          1 1.3191e+14 9.3465e+14 427307
## 
## Step:  AIC=424578.3
## price ~ sqft_living + view + grade + yr_built + waterfront + 
##     bedrooms + bathrooms + sqft_lot15
## 
##                 Df  Sum of Sq        RSS    AIC
## + floors         1 2.6691e+12 7.9534e+14 424522
## + condition      1 1.8839e+12 7.9612e+14 424539
## + sqft_living15  1 1.7001e+12 7.9631e+14 424543
## + renovated      1 2.8838e+11 7.9772e+14 424574
## <none>                        7.9801e+14 424578
## + sqft_lot       1 1.7180e+09 7.9800e+14 424580
## - sqft_lot15     1 4.7365e+12 8.0274e+14 424679
## - bathrooms      1 9.5989e+12 8.0760e+14 424783
## - view           1 1.3819e+13 8.1182e+14 424873
## - bedrooms       1 1.4684e+13 8.1269e+14 424892
## - waterfront     1 3.9104e+13 8.3711e+14 425403
## - sqft_living    1 1.1487e+14 9.1287e+14 426902
## - yr_built       1 1.1940e+14 9.1740e+14 426987
## - grade          1 1.2849e+14 9.2649e+14 427158
## 
## Step:  AIC=424522.4
## price ~ sqft_living + view + grade + yr_built + waterfront + 
##     bedrooms + bathrooms + sqft_lot15 + floors
## 
##                 Df  Sum of Sq        RSS    AIC
## + condition      1 2.5106e+12 7.9283e+14 424470
## + sqft_living15  1 2.1413e+12 7.9319e+14 424478
## + renovated      1 1.5045e+11 7.9519e+14 424521
## <none>                        7.9534e+14 424522
## + sqft_lot       1 1.3285e+08 7.9534e+14 424524
## - floors         1 2.6691e+12 7.9801e+14 424578
## - sqft_lot15     1 4.1274e+12 7.9946e+14 424610
## - bathrooms      1 8.7689e+12 8.0410e+14 424710
## - view           1 1.4339e+13 8.0968e+14 424829
## - bedrooms       1 1.4489e+13 8.0982e+14 424833
## - waterfront     1 3.8544e+13 8.3388e+14 425339
## - sqft_living    1 1.1435e+14 9.0969e+14 426843
## - grade          1 1.1700e+14 9.1233e+14 426893
## - yr_built       1 1.1745e+14 9.1279e+14 426902
## 
## Step:  AIC=424469.7
## price ~ sqft_living + view + grade + yr_built + waterfront + 
##     bedrooms + bathrooms + sqft_lot15 + floors + condition
## 
##                 Df  Sum of Sq        RSS    AIC
## + sqft_living15  1 2.2608e+12 7.9056e+14 424422
## + renovated      1 4.1564e+11 7.9241e+14 424463
## <none>                        7.9283e+14 424470
## + sqft_lot       1 1.0595e+08 7.9283e+14 424472
## - condition      1 2.5106e+12 7.9534e+14 424522
## - floors         1 3.2958e+12 7.9612e+14 424539
## - sqft_lot15     1 4.1731e+12 7.9700e+14 424558
## - bathrooms      1 8.2704e+12 8.0110e+14 424647
## - view           1 1.4194e+13 8.0702e+14 424775
## - bedrooms       1 1.5113e+13 8.0794e+14 424794
## - waterfront     1 3.8519e+13 8.3134e+14 425288
## - yr_built       1 9.9806e+13 8.9263e+14 426518
## - sqft_living    1 1.1395e+14 9.0677e+14 426790
## - grade          1 1.1745e+14 9.1027e+14 426856
## 
## Step:  AIC=424422.3
## price ~ sqft_living + view + grade + yr_built + waterfront + 
##     bedrooms + bathrooms + sqft_lot15 + floors + condition + 
##     sqft_living15
## 
##                 Df  Sum of Sq        RSS    AIC
## + renovated      1 4.9482e+11 7.9007e+14 424414
## <none>                        7.9056e+14 424422
## + sqft_lot       1 4.5084e+09 7.9056e+14 424424
## - sqft_living15  1 2.2608e+12 7.9283e+14 424470
## - condition      1 2.6301e+12 7.9319e+14 424478
## - floors         1 3.8081e+12 7.9437e+14 424503
## - sqft_lot15     1 4.5355e+12 7.9510e+14 424519
## - bathrooms      1 9.2553e+12 7.9982e+14 424622
## - view           1 1.2869e+13 8.0343e+14 424700
## - bedrooms       1 1.5167e+13 8.0573e+14 424749
## - waterfront     1 3.9132e+13 8.2970e+14 425256
## - sqft_living    1 8.2208e+13 8.7277e+14 426131
## - grade          1 9.6291e+13 8.8686e+14 426408
## - yr_built       1 1.0173e+14 8.9229e+14 426513
## 
## Step:  AIC=424413.5
## price ~ sqft_living + view + grade + yr_built + waterfront + 
##     bedrooms + bathrooms + sqft_lot15 + floors + condition + 
##     sqft_living15 + renovated
## 
##                 Df  Sum of Sq        RSS    AIC
## <none>                        7.9007e+14 424414
## + sqft_lot       1 4.8683e+09 7.9006e+14 424415
## - renovated      1 4.9482e+11 7.9056e+14 424422
## - sqft_living15  1 2.3400e+12 7.9241e+14 424463
## - condition      1 2.9329e+12 7.9300e+14 424476
## - floors         1 3.6017e+12 7.9367e+14 424490
## - sqft_lot15     1 4.5943e+12 7.9466e+14 424512
## - bathrooms      1 8.7444e+12 7.9881e+14 424602
## - view           1 1.2706e+13 8.0278e+14 424687
## - bedrooms       1 1.5062e+13 8.0513e+14 424738
## - waterfront     1 3.8512e+13 8.2858e+14 425234
## - sqft_living    1 8.1831e+13 8.7190e+14 426116
## - yr_built       1 8.9033e+13 8.7910e+14 426258
## - grade          1 9.6056e+13 8.8613e+14 426395
## 
## Call:
## lm(formula = price ~ sqft_living + view + grade + yr_built + 
##     waterfront + bedrooms + bathrooms + sqft_lot15 + floors + 
##     condition + sqft_living15 + renovated, data = train)
## 
## Coefficients:
##   (Intercept)    sqft_living           view          grade       yr_built  
##    -6.143e+05      1.635e+02      4.105e+04      1.138e+05     -6.491e+04  
##    waterfront       bedrooms      bathrooms     sqft_lot15         floors  
##     6.004e+05     -4.016e+04      4.429e+04     -6.069e-01      3.309e+04  
##     condition  sqft_living15      renovated  
##     2.183e+04      2.810e+01      2.843e+04

< Forward Selection >

step(regnull, scope=list(lower=regnull, upper=regfull), direction="forward")
## Start:  AIC=442775.7
## price ~ 1
## 
##                 Df  Sum of Sq        RSS    AIC
## + sqft_living    1 1.1329e+15 1.1553e+15 430962
## + grade          1 1.0242e+15 1.2640e+15 432516
## + sqft_living15  1 7.8905e+14 1.4992e+15 435466
## + bathrooms      1 6.0414e+14 1.6841e+15 437477
## + view           1 3.5639e+14 1.9318e+15 439850
## + bedrooms       1 2.0920e+14 2.0790e+15 441120
## + waterfront     1 1.7097e+14 2.1172e+15 441435
## + floors         1 1.6137e+14 2.1268e+15 441513
## + renovated      1 3.5656e+13 2.2525e+15 442506
## + sqft_lot       1 2.0037e+13 2.2682e+15 442626
## + sqft_lot15     1 1.5906e+13 2.2723e+15 442657
## + yr_built       1 5.6435e+12 2.2826e+15 442735
## + condition      1 3.4753e+12 2.2847e+15 442751
## <none>                        2.2882e+15 442776
## 
## Step:  AIC=430962
## price ~ sqft_living
## 
##                 Df  Sum of Sq        RSS    AIC
## + view           1 9.6282e+13 1.0590e+15 429459
## + grade          1 9.6101e+13 1.0592e+15 429462
## + waterfront     1 9.0018e+13 1.0653e+15 429561
## + yr_built       1 6.8189e+13 1.0871e+15 429912
## + bedrooms       1 3.3062e+13 1.1223e+15 430462
## + renovated      1 1.6775e+13 1.1386e+15 430711
## + sqft_living15  1 1.6529e+13 1.1388e+15 430715
## + condition      1 1.3494e+13 1.1418e+15 430761
## + sqft_lot15     1 6.0106e+12 1.1493e+15 430874
## + sqft_lot       1 3.2768e+12 1.1520e+15 430915
## + bathrooms      1 2.3654e+12 1.1530e+15 430929
## + floors         1 3.1999e+11 1.1550e+15 430959
## <none>                        1.1553e+15 430962
## 
## Step:  AIC=429459.5
## price ~ sqft_living + view
## 
##                 Df  Sum of Sq        RSS    AIC
## + grade          1 8.4847e+13 9.7420e+14 428018
## + yr_built       1 4.6244e+13 1.0128e+15 428690
## + waterfront     1 3.8196e+13 1.0208e+15 428826
## + bedrooms       1 2.1640e+13 1.0374e+15 429105
## + renovated      1 1.0244e+13 1.0488e+15 429293
## + condition      1 9.3046e+12 1.0497e+15 429309
## + sqft_living15  1 9.2323e+12 1.0498e+15 429310
## + sqft_lot15     1 6.9146e+12 1.0521e+15 429348
## + sqft_lot       1 3.9236e+12 1.0551e+15 429397
## + bathrooms      1 2.3140e+12 1.0567e+15 429424
## + floors         1 1.7448e+12 1.0573e+15 429433
## <none>                        1.0590e+15 429459
## 
## Step:  AIC=428017.6
## price ~ sqft_living + view + grade
## 
##                 Df  Sum of Sq        RSS    AIC
## + yr_built       1 1.1066e+14 8.6354e+14 425935
## + waterfront     1 4.0072e+13 9.3412e+14 427293
## + condition      1 2.0864e+13 9.5333e+14 427645
## + renovated      1 1.3455e+13 9.6074e+14 427779
## + bedrooms       1 1.1057e+13 9.6314e+14 427822
## + sqft_lot15     1 5.0887e+12 9.6911e+14 427929
## + floors         1 2.7444e+12 9.7145e+14 427971
## + sqft_lot       1 2.6566e+12 9.7154e+14 427972
## <none>                        9.7420e+14 428018
## + bathrooms      1 1.1035e+11 9.7408e+14 428018
## + sqft_living15  1 1.4925e+09 9.7419e+14 428020
## 
## Step:  AIC=425934.8
## price ~ sqft_living + view + grade + yr_built
## 
##                 Df  Sum of Sq        RSS    AIC
## + waterfront     1 4.0818e+13 8.2272e+14 425100
## + bedrooms       1 1.1869e+13 8.5167e+14 425698
## + bathrooms      1 6.1747e+12 8.5736e+14 425813
## + floors         1 4.8417e+12 8.5869e+14 425840
## + sqft_lot15     1 3.7083e+12 8.5983e+14 425862
## + sqft_lot       1 2.1378e+12 8.6140e+14 425894
## + renovated      1 1.5630e+12 8.6197e+14 425906
## + condition      1 1.5122e+12 8.6202e+14 425907
## + sqft_living15  1 3.1983e+11 8.6322e+14 425930
## <none>                        8.6354e+14 425935
## 
## Step:  AIC=425099.6
## price ~ sqft_living + view + grade + yr_built + waterfront
## 
##                 Df  Sum of Sq        RSS    AIC
## + bedrooms       1 9.8179e+12 8.1290e+14 424894
## + bathrooms      1 6.7148e+12 8.1600e+14 424960
## + floors         1 4.1674e+12 8.1855e+14 425014
## + sqft_lot15     1 3.8603e+12 8.1886e+14 425020
## + sqft_lot       1 1.9265e+12 8.2079e+14 425061
## + condition      1 1.5979e+12 8.2112e+14 425068
## + renovated      1 7.6988e+11 8.2195e+14 425085
## + sqft_living15  1 6.2102e+11 8.2210e+14 425089
## <none>                        8.2272e+14 425100
## 
## Step:  AIC=424894
## price ~ sqft_living + view + grade + yr_built + waterfront + 
##     bedrooms
## 
##                 Df  Sum of Sq        RSS    AIC
## + bathrooms      1 1.0157e+13 8.0274e+14 424679
## + sqft_lot15     1 5.2950e+12 8.0760e+14 424783
## + floors         1 4.2569e+12 8.0864e+14 424805
## + sqft_lot       1 2.8889e+12 8.1001e+14 424834
## + condition      1 2.1432e+12 8.1076e+14 424850
## + renovated      1 7.1013e+11 8.1219e+14 424881
## + sqft_living15  1 4.9966e+11 8.1240e+14 424885
## <none>                        8.1290e+14 424894
## 
## Step:  AIC=424678.6
## price ~ sqft_living + view + grade + yr_built + waterfront + 
##     bedrooms + bathrooms
## 
##                 Df  Sum of Sq        RSS    AIC
## + sqft_lot15     1 4.7365e+12 7.9801e+14 424578
## + floors         1 3.2782e+12 7.9946e+14 424610
## + sqft_lot       1 2.6152e+12 8.0013e+14 424624
## + condition      1 1.7809e+12 8.0096e+14 424642
## + sqft_living15  1 1.3301e+12 8.0141e+14 424652
## + renovated      1 2.6833e+11 8.0247e+14 424675
## <none>                        8.0274e+14 424679
## 
## Step:  AIC=424578.3
## price ~ sqft_living + view + grade + yr_built + waterfront + 
##     bedrooms + bathrooms + sqft_lot15
## 
##                 Df  Sum of Sq        RSS    AIC
## + floors         1 2.6691e+12 7.9534e+14 424522
## + condition      1 1.8839e+12 7.9612e+14 424539
## + sqft_living15  1 1.7001e+12 7.9631e+14 424543
## + renovated      1 2.8838e+11 7.9772e+14 424574
## <none>                        7.9801e+14 424578
## + sqft_lot       1 1.7180e+09 7.9800e+14 424580
## 
## Step:  AIC=424522.4
## price ~ sqft_living + view + grade + yr_built + waterfront + 
##     bedrooms + bathrooms + sqft_lot15 + floors
## 
##                 Df  Sum of Sq        RSS    AIC
## + condition      1 2.5106e+12 7.9283e+14 424470
## + sqft_living15  1 2.1413e+12 7.9319e+14 424478
## + renovated      1 1.5045e+11 7.9519e+14 424521
## <none>                        7.9534e+14 424522
## + sqft_lot       1 1.3285e+08 7.9534e+14 424524
## 
## Step:  AIC=424469.7
## price ~ sqft_living + view + grade + yr_built + waterfront + 
##     bedrooms + bathrooms + sqft_lot15 + floors + condition
## 
##                 Df  Sum of Sq        RSS    AIC
## + sqft_living15  1 2.2608e+12 7.9056e+14 424422
## + renovated      1 4.1564e+11 7.9241e+14 424463
## <none>                        7.9283e+14 424470
## + sqft_lot       1 1.0595e+08 7.9283e+14 424472
## 
## Step:  AIC=424422.3
## price ~ sqft_living + view + grade + yr_built + waterfront + 
##     bedrooms + bathrooms + sqft_lot15 + floors + condition + 
##     sqft_living15
## 
##             Df  Sum of Sq        RSS    AIC
## + renovated  1 4.9482e+11 7.9007e+14 424414
## <none>                    7.9056e+14 424422
## + sqft_lot   1 4.5084e+09 7.9056e+14 424424
## 
## Step:  AIC=424413.5
## price ~ sqft_living + view + grade + yr_built + waterfront + 
##     bedrooms + bathrooms + sqft_lot15 + floors + condition + 
##     sqft_living15 + renovated
## 
##            Df  Sum of Sq        RSS    AIC
## <none>                   7.9007e+14 424414
## + sqft_lot  1 4868282009 7.9006e+14 424415
## 
## Call:
## lm(formula = price ~ sqft_living + view + grade + yr_built + 
##     waterfront + bedrooms + bathrooms + sqft_lot15 + floors + 
##     condition + sqft_living15 + renovated, data = train)
## 
## Coefficients:
##   (Intercept)    sqft_living           view          grade       yr_built  
##    -6.143e+05      1.635e+02      4.105e+04      1.138e+05     -6.491e+04  
##    waterfront       bedrooms      bathrooms     sqft_lot15         floors  
##     6.004e+05     -4.016e+04      4.429e+04     -6.069e-01      3.309e+04  
##     condition  sqft_living15      renovated  
##     2.183e+04      2.810e+01      2.843e+04

< Backward Elimination >

step(regfull, scope=list(lower=regnull, upper=regfull), direction="backward")
## Start:  AIC=424415.4
## price ~ bedrooms + bathrooms + floors + waterfront + view + condition + 
##     grade + yr_built + sqft_living + sqft_lot + sqft_living15 + 
##     sqft_lot15 + renovated
## 
##                 Df  Sum of Sq        RSS    AIC
## - sqft_lot       1 4.8683e+09 7.9007e+14 424414
## <none>                        7.9006e+14 424415
## - renovated      1 4.9518e+11 7.9056e+14 424424
## - sqft_lot15     1 2.3061e+12 7.9237e+14 424464
## - sqft_living15  1 2.3447e+12 7.9241e+14 424465
## - condition      1 2.9358e+12 7.9300e+14 424478
## - floors         1 3.6059e+12 7.9367e+14 424492
## - bathrooms      1 8.7436e+12 7.9881e+14 424604
## - view           1 1.2697e+13 8.0276e+14 424689
## - bedrooms       1 1.5031e+13 8.0510e+14 424739
## - waterfront     1 3.8509e+13 8.2857e+14 425236
## - sqft_living    1 8.1390e+13 8.7146e+14 426109
## - yr_built       1 8.9023e+13 8.7909e+14 426259
## - grade          1 9.6051e+13 8.8612e+14 426397
## 
## Step:  AIC=424413.5
## price ~ bedrooms + bathrooms + floors + waterfront + view + condition + 
##     grade + yr_built + sqft_living + sqft_living15 + sqft_lot15 + 
##     renovated
## 
##                 Df  Sum of Sq        RSS    AIC
## <none>                        7.9007e+14 424414
## - renovated      1 4.9482e+11 7.9056e+14 424422
## - sqft_living15  1 2.3400e+12 7.9241e+14 424463
## - condition      1 2.9329e+12 7.9300e+14 424476
## - floors         1 3.6017e+12 7.9367e+14 424490
## - sqft_lot15     1 4.5943e+12 7.9466e+14 424512
## - bathrooms      1 8.7444e+12 7.9881e+14 424602
## - view           1 1.2706e+13 8.0278e+14 424687
## - bedrooms       1 1.5062e+13 8.0513e+14 424738
## - waterfront     1 3.8512e+13 8.2858e+14 425234
## - sqft_living    1 8.1831e+13 8.7190e+14 426116
## - yr_built       1 8.9033e+13 8.7910e+14 426258
## - grade          1 9.6056e+13 8.8613e+14 426395
## 
## Call:
## lm(formula = price ~ bedrooms + bathrooms + floors + waterfront + 
##     view + condition + grade + yr_built + sqft_living + sqft_living15 + 
##     sqft_lot15 + renovated, data = train)
## 
## Coefficients:
##   (Intercept)       bedrooms      bathrooms         floors     waterfront  
##    -6.143e+05     -4.016e+04      4.429e+04      3.309e+04      6.004e+05  
##          view      condition          grade       yr_built    sqft_living  
##     4.105e+04      2.183e+04      1.138e+05     -6.491e+04      1.635e+02  
## sqft_living15     sqft_lot15      renovated  
##     2.810e+01     -6.069e-01      2.843e+04

As a result, every predictor except sqft_lot was chosen. We can use rest of the predictors in building our model.

< Linear Regression >

Using lm function, let’s come up with our first model. According to the summary, every predictor is statistically significant, but what’s counterintuitive that the sign of coefficient of bedrooms and sqft_lot15 is negative.

PRESS <- function(linear.model) { 
  ## get the residuals from the linear.model. ## extract hat from lm.influence to obtain the leverages 
  pr <- residuals(linear.model) / (1-lm.influence(linear.model)$hat)
  ## calculate the PRESS by squaring each term and adding them up 
  PRESS <- sum(pr ^ 2) 
  return(PRESS) 
}
result <- lm(price ~ bedrooms + bathrooms + floors + waterfront + view + condition + grade + yr_built + sqft_living + 
               sqft_living15 + sqft_lot15 + renovated, data = train)
summary(result)
## 
## Call:
## lm(formula = price ~ bedrooms + bathrooms + floors + waterfront + 
##     view + condition + grade + yr_built + sqft_living + sqft_living15 + 
##     sqft_lot15 + renovated, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1231890  -111153    -8250    91144  4324636 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -6.143e+05  1.798e+04 -34.169  < 2e-16 ***
## bedrooms      -4.016e+04  2.213e+03 -18.149  < 2e-16 ***
## bathrooms      4.429e+04  3.203e+03  13.828  < 2e-16 ***
## floors         3.309e+04  3.729e+03   8.875  < 2e-16 ***
## waterfront     6.004e+05  2.069e+04  29.020  < 2e-16 ***
## view           4.105e+04  2.462e+03  16.669  < 2e-16 ***
## condition      2.183e+04  2.726e+03   8.008 1.24e-15 ***
## grade          1.138e+05  2.482e+03  45.831  < 2e-16 ***
## yr_built      -6.491e+04  1.471e+03 -44.124  < 2e-16 ***
## sqft_living    1.635e+02  3.865e+00  42.302  < 2e-16 ***
## sqft_living15  2.810e+01  3.929e+00   7.153 8.81e-13 ***
## sqft_lot15    -6.069e-01  6.055e-02 -10.023  < 2e-16 ***
## renovated      2.843e+04  8.642e+03   3.289  0.00101 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 213800 on 17277 degrees of freedom
## Multiple R-squared:  0.6547, Adjusted R-squared:  0.6545 
## F-statistic:  2730 on 12 and 17277 DF,  p-value: < 2.2e-16
test$predict <- round(predict(result, newdata = test))

test_mse_ln <- mean((test$price - test$predict)^2)
test_mse_ln
## [1] 52482587023
summary(result)$r.squared
## [1] 0.6547201
summary(result)$adj.r.squared
## [1] 0.6544803
PRESS(result)
## [1] 7.947446e+14
##Find SST 
anova_result<-anova(result) 
SST<-sum(anova_result$"Sum Sq") ##R2 pred 
Rsq_pred <- 1-PRESS(result)/SST 
Rsq_pred
## [1] 0.6526771

According to VIFs, as all numbers are below threshold (10), there is no sign of multicollinearity in our model.

vif(result)
##      bedrooms     bathrooms        floors    waterfront          view 
##      1.607513      2.207145      1.532359      1.198319      1.353198 
##     condition         grade      yr_built   sqft_living sqft_living15 
##      1.209302      3.208107      1.793612      4.731896      2.740061 
##    sqft_lot15     renovated 
##      1.065314      1.126851

Let’s check residual plot to check regression assumption. It seems that the second assumption, constanct variance is violated according to the first plot. To be specific, variance gets larger as fitted y gets larger. Therefore, we should implement y transformation to address this issue.

yhat <- result$fitted.values
res <- result$residuals
Data <- data.frame(train, yhat, res)

ggplot(Data, aes(x=yhat,y=res))+
  geom_point()+
  geom_hline(yintercept=0, color="red")+
  labs(x="Fitted y",
       y="Residuals",
       title="Residual Plot")

acf(res)

qqnorm(res)
qqline(res, col="red")

Box Cox method is an analytical way to decide how to transform the response variable to achieve constant variance. According to the plot, the optimal \(\lambda\) is 0.1.

boxcox(result,lambda = seq(-1.,1,0.5))

Therefore, let’s transform our \(y^{*} = y^0.1\).

train <- train %>% mutate(price = price ^ 0.1)

After y-transformation, our model has now better results.

result2 <- lm(price ~ bedrooms + bathrooms + floors + waterfront + view + condition + grade + yr_built + sqft_living + 
               sqft_living15 + sqft_lot15 + renovated, data = train)
summary(result2)
## 
## Call:
## lm(formula = price ~ bedrooms + bathrooms + floors + waterfront + 
##     view + condition + grade + yr_built + sqft_living + sqft_living15 + 
##     sqft_lot15 + renovated, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.58082 -0.07908  0.00364  0.07785  0.47712 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    2.911e+00  9.660e-03 301.295  < 2e-16 ***
## bedrooms      -8.982e-03  1.189e-03  -7.555 4.41e-14 ***
## bathrooms      2.592e-02  1.721e-03  15.062  < 2e-16 ***
## floors         3.859e-02  2.004e-03  19.263  < 2e-16 ***
## waterfront     1.443e-01  1.112e-02  12.981  < 2e-16 ***
## view           1.826e-02  1.323e-03  13.802  < 2e-16 ***
## condition      1.828e-02  1.465e-03  12.479  < 2e-16 ***
## grade          7.377e-02  1.334e-03  55.312  < 2e-16 ***
## yr_built      -3.756e-02  7.904e-04 -47.518  < 2e-16 ***
## sqft_living    5.421e-05  2.077e-06  26.104  < 2e-16 ***
## sqft_living15  3.686e-05  2.111e-06  17.463  < 2e-16 ***
## sqft_lot15    -1.960e-07  3.254e-08  -6.025 1.73e-09 ***
## renovated      1.466e-02  4.643e-03   3.158  0.00159 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1149 on 17277 degrees of freedom
## Multiple R-squared:  0.6585, Adjusted R-squared:  0.6583 
## F-statistic:  2776 on 12 and 17277 DF,  p-value: < 2.2e-16
test$predict <- round(predict(result2, newdata = test) ^ 10)
test_mse_ln_2 <- mean((test$price - test$predict)^2)
test_mse_ln_2
## [1] 43928749327
summary(result2)$r.squared
## [1] 0.658523
summary(result2)$adj.r.squared
## [1] 0.6582858
PRESS(result2)
## [1] 228.5181
##Find SST 
anova_result<-anova(result2) 
SST<-sum(anova_result$"Sum Sq") 

##R2 pred 
Rsq_pred <- 1-PRESS(result2)/SST 
Rsq_pred
## [1] 0.6578727

According to residual plot, acf plot, and normal probability plot, all of the regression assumptions are satisified.

yhat<-result2$fitted.values 
res<-result2$residuals
Data<-data.frame(train,yhat,res)

ggplot(Data, aes(x=yhat,y=res))+
  geom_point()+
  geom_hline(yintercept=0, color="red")+
  labs(x="Fitted y",
       y="Residuals",
       title="Residual Plot")

acf(res)

qqnorm(res)
qqline(res, col="red")

tail(test)
##         price bedrooms bathrooms floors waterfront view condition grade
## 21585  380000        3         2      2          0    0         3     7
## 21592  572000        4         3      2          0    0         3     8
## 21594 1088000        5         4      2          0    2         3    10
## 21599  541800        4         2      2          0    2         3     9
## 21602  467000        3         2      3          0    0         3     8
## 21607 1007500        4         4      2          0    0         3     9
##       yr_built sqft_living sqft_lot sqft_living15 sqft_lot15 renovated predict
## 21585        5        1260      900          1310       1415         0  285799
## 21592        5        2770     3852          1810       5641         0  484241
## 21594        5        4170     8142          3030       7980         0 1113250
## 21599        5        3118     7866          2673       6500         0  692559
## 21602        5        1425     1179          1285       1253         0  400368
## 21607        5        3510     7200          2050       6200         0  717624

4. Identifying Outliers, High Leverage Points, and Influential Points

In order to increase the performance our model, let’s identify outliers, high leverage points, and influential points in the dataset.

< Outlier Detection >

By checking standardized residual, studentized reisdual, and externally studentized reisudals, we can identify the presence of outliers in our dataset. As a result, there is no outlier in our dataset.

n <- nrow(train)
p <- 13
cv <- qt(1-0.05,(2*n), n-1-p)
res <- result2$residuals
standard.res<- res/summary(result2)$sigma
student.res <- rstandard(result2)
ext.student.res <- rstudent(result2)

ext.student.res[abs(ext.student.res)>cv]
## named numeric(0)
res.frame<-data.frame(res,standard.res,
                      student.res,ext.student.res)
par(mfrow=c(1,3))
plot(result2$fitted.values,standard.res,
     main="Standardized Residuals",
     ylim=c(-4.5,4.5))
plot(result2$fitted.values,student.res,
     main="Studentized Residuals",
     ylim=c(-4.5,4.5))
plot(result2$fitted.values,ext.student.res,
     main="Externally  Studentized Residuals",
     ylim=c(-4.5,4.5))

< Leverage >

By checking leverage, we can identify how many observations are far away from average of predictors and response. According to the result, total 1604 observations (about 9.3 %) are far from the centroid of the predictor space. These high leverage points are potentially to be influential observations.

# leverage
lev <- lm.influence(result2)$hat

# identify high leverage points
x <- lev[lev > 2 * p / n]
length(x) / n
## [1] 0.09277039

< Detecting Influential Observations >

After finding the observations that are outlying / high leverage, the next step is to ascertain whether or not these observations are influential. Measures of influence address how much estimates (fitted values, coefficients, etc..) will change if observation was deleted. By looking at Cook’s Distance, DFFITS (Difference in Fits), DFBETAS (Differenece in Betas), we can detect influetial observations.

Result from Cook’s distance does not show sign of infleutial observations.

# COOK's Distance
COOKS<-cooks.distance(result2)
y <- COOKS[COOKS>qf(0.5,p,n-p)]
length(y)
## [1] 0

However, the result from DFFITS shows the list indices of influential points.

# DFFITS (Difference in Fits)
DFFITS <- dffits(result2)
z <- (DFFITS[abs(DFFITS)>2*sqrt(p/n)])
z
##       18183       20728       18175       13284       19485       13973 
##  0.07108293  0.06319248  0.05902664 -0.05717550  0.10840119 -0.08076250 
##        8547        5936         719        5136        9086       14837 
## -0.09930367 -0.10606113  0.06050325 -0.05954569 -0.14109147 -0.07918745 
##       12669        5388        4590        3616       12339        2137 
## -0.07959627 -0.10368078  0.05747444 -0.08701908 -0.05614251  0.05791141 
##       10141        8706       10994        3760       11014         116 
## -0.05546738  0.05800459 -0.06496679 -0.06195006  0.07061236 -0.08237987 
##         313        3157       15135       10841       14276       11447 
##  0.08174423  0.08015213 -0.16377244  0.07990202  0.12474943  0.10993350 
##       17871        7379        3363       18004       11600       18520 
## -0.05853706 -0.06872001  0.06765316  0.05800427 -0.05989450  0.05573200 
##       12277       20594        2506       18334        4777        2267 
##  0.06004510  0.07440618 -0.06879759 -0.07068597 -0.06192556 -0.05904749 
##         217        7987        1654        9885        4630       16940 
##  0.06321358  0.08574407 -0.10550317 -0.06103064  0.06503921 -0.05673738 
##       12552       19149       11471       16889        4475        3375 
## -0.05940134  0.11966224  0.05517153 -0.06736042 -0.06261089  0.05686820 
##       15071        3158        4527        4612        6963       17143 
## -0.07444605 -0.07593816  0.06007790 -0.10863439  0.05908704  0.07940221 
##       10386        2182       13816        9453        4086       18595 
## -0.08426683  0.07208450 -0.06449253  0.06522075  0.06137966 -0.12722131 
##       17349       19650       14451        1362       18793       19150 
## -0.06060542  0.05724713  0.05644792  0.06068046 -0.06347476  0.05484574 
##       20248       18333        3915        2076       18428       19826 
##  0.06919517 -0.22295939  0.14523765  0.14880907 -0.07313483 -0.05703511 
##        3229       11559       12827        3768        6305       17658 
## -0.09522491  0.05496718  0.10291240 -0.05711788 -0.05586900 -0.14615723 
##       14514       12711       14424        2985        1865        8093 
##  0.10714687 -0.05502144 -0.11554095 -0.05557590 -0.06105383 -0.11803420 
##       16908        2445        5082        8569       13852         283 
## -0.06369343  0.06172163  0.05934065 -0.09432188  0.08413628  0.06229435 
##       12986       16776        3439       12105       13236       20093 
## -0.09410781  0.05624977  0.07046594  0.06213938  0.07633140  0.06305631 
##       16378        4550        3726        4442       19298         301 
##  0.09433068 -0.09545193 -0.06862263  0.06305899  0.05528995  0.15349075 
##       20175        4766       17198       14386        6897       12187 
##  0.06116196  0.06429567 -0.16915868  0.17323899 -0.06745552  0.06839312 
##       19382       16270         839       17900       10959       17152 
##  0.07140647  0.09383761 -0.05652887  0.08598118  0.09773307  0.14946613 
##         312       12371       18507        7990       10505       10662 
## -0.07336048  0.08499062 -0.06476957  0.16120651  0.07064676 -0.05598395 
##        5748       13710       18896        2900       17153       21384 
## -0.10499340 -0.09135252 -0.07150133  0.11815055 -0.13409275  0.05926969 
##        5828        5257       15669       19472         886       15765 
##  0.09149204 -0.06091287 -0.17495360  0.07143248  0.11123039  0.05952294 
##       11536       11258       13968       15431       21432       18355 
##  0.12620039  0.08724044  0.14716609  0.05713714  0.13908702  0.05583365 
##       19157       20387        3224         359       13985        4150 
## -0.10757396  0.09532979  0.06818043  0.11549926  0.06617025  0.12054932 
##       10981       14222        7295         458       17493        2032 
## -0.10667159  0.05560285  0.17822087 -0.12867810 -0.09052462 -0.13154013 
##       14191       16681       18427        8540        1295       20825 
##  0.08026588  0.05782038  0.05612450  0.06604821 -0.08799128  0.07388799 
##         657        4697       12588       17529       18469        5304 
##  0.29439907 -0.06247325  0.05618829 -0.11216622 -0.11493031 -0.06632014 
##        5792       19481       12725       17549       15163       19453 
##  0.06249082  0.07054349 -0.07419790 -0.05523865  0.06383186  0.18368305 
##       19175         997        9213       20155       15159       20940 
## -0.06949802 -0.05526628 -0.05636474  0.18376486  0.06216365  0.05913841 
##       20185       17660       17364       13436       16845        6824 
##  0.05601295  0.05552905  0.07698730 -0.06277547 -0.07125624  0.08179609 
##        2224         247        3043       17018        7424       16345 
##  0.09145185  0.26185334  0.06670999 -0.06458052  0.05983266 -0.05524523 
##        6995       21326        6892       13831       18913       11285 
## -0.07887286  0.10973710  0.07532461  0.06101163  0.08882648  0.05680292 
##       19138       18686       13478        3541        4082       12152 
## -0.07207382  0.08964322  0.09463365 -0.06085206  0.05802625  0.06604497 
##        8029        9369       18587       18330       13021        9642 
##  0.09324214  0.07663521 -0.09396192 -0.06996668  0.10007375  0.08108778 
##       11879        1031       12697       18288       14775        2558 
##  0.08826250  0.06234696  0.07032536  0.07762940 -0.17375180 -0.07400285 
##       12908       17586       14964         519        4923        5789 
## -0.07015386  0.06756869  0.05706398  0.07515092  0.08578459 -0.05544796 
##       15697        6198         231        6234          70       18022 
## -0.07067144  0.06389339 -0.13447681 -0.08730313  0.07522748  0.10218588 
##        4191       15328        9622       16536        6809       12150 
##  0.05845354 -0.06122496 -0.06964431 -0.09010370  0.08078622  0.06803387 
##       20225       20756       10254         541       16253       11832 
##  0.06716845  0.07734146 -0.06630145 -0.17382420  0.11061058 -0.10306714 
##        3205       17745        1221        8917       17597           3 
## -0.12713790 -0.07132727 -0.06612544  0.06411035  0.10183792 -0.06041297 
##        6741       16393         415        3934       18456       20097 
##  0.09250009  0.08849357 -0.05994076  0.06082247  0.10607392  0.09239186 
##        2590       20895        4930       13630        4652        8639 
## -0.14969553  0.07050700 -0.05503907  0.09459262 -0.05945616  0.16355438 
##        5178        5377       20934         882        4219         770 
##  0.05639895 -0.09218005  0.07791587 -0.06188406 -0.08780400 -0.05972490 
##       10586        6046        8750        2154        3885        5852 
## -0.07195325  0.06241155 -0.05910855  0.05722436 -0.05496736  0.06114941 
##        1787       14071        9609       14295       18867       15214 
##  0.06988851  0.05976932  0.09815374 -0.07514992  0.07599147  0.05489008 
##        9344        1311       21345       16179       21369       18607 
##  0.07304106 -0.07974223 -0.18024150 -0.07099211  0.08573644  0.10147200 
##       18203       13044        5367       12427       14329       16005 
##  0.05742608 -0.05759074 -0.05711293 -0.06094474  0.06175693  0.06851916 
##       15495        4924        2618       18002        6403        7232 
## -0.07743400  0.08844843 -0.05525030 -0.06487375  0.10954667  0.07041252 
##        4860          66       10767        4412       19469       12125 
##  0.05900183 -0.06856526  0.07032302  0.12947835  0.05498374  0.11153748 
##       16521        9078        7846       19957        1326        7537 
##  0.10587766  0.06323873 -0.07747703  0.07155702 -0.05568308  0.06255010 
##       13378       18795        5669        3764       11704        3230 
##  0.05822239  0.05942163 -0.06124087  0.05863350 -0.06386975  0.07751080 
##        7934       10361        4241        5881       16707       18276 
##  0.07638054 -0.11712218  0.06437378  0.11281363 -0.06437941 -0.16233349 
##       17954        1434       14804        3537        8708       20021 
## -0.05593713  0.05670237  0.05646210  0.13035223  0.05829097  0.05559131 
##       18227       20624        4025         126       10111       15522 
##  0.11689148  0.05821774 -0.33887182  0.08000492  0.06184661  0.05982442 
##       20875        9460        3951       20963        8618        7783 
## -0.05502156 -0.07772365 -0.07450546 -0.09221637 -0.06452741  0.07079328 
##       18513        9295       12568        6458       15841       18096 
##  0.06824290  0.05624665  0.05515196  0.07072592  0.11165868  0.05589893 
##        6953       11875       10319        4872       17475        4706 
## -0.08437976  0.11635788 -0.08086892  0.05690086 -0.25297126  0.05694014 
##       19668        8050       18380        3872        2944         295 
##  0.05792198  0.13649415  0.19754956  0.08016384 -0.06402981 -0.05522868 
##       14472       14242       20042        2714       12647         752 
## -0.10780025 -0.08271626  0.06914056 -0.20548676  0.09352166  0.05727542 
##        4181       14737       21352        4763       17845       15247 
## -0.06258015 -0.06160901  0.12384079  0.11344981  0.07766217  0.09460210 
##        9926       18761       13788       18528         240        5025 
##  0.06181817 -0.06140221  0.06118910 -0.06251784 -0.08919034 -0.06539265 
##       15023        1397        8830       18989       13966       10892 
##  0.06803801 -0.06003608 -0.07737612 -0.06980278 -0.05644983 -0.09757097 
##        1809       10264         760       17402       15633       17577 
##  0.07733518  0.19057000 -0.08073097 -0.09198329  0.07977387  0.10234049 
##       19324       19685        1883         159        2896       19782 
##  0.06819736  0.08500270  0.07036221 -0.08069348 -0.08267836 -0.06946275 
##       11402        4363        8164       15040        7416       18705 
##  0.05898108  0.06744806  0.08470630  0.19437678  0.08842645  0.06036455 
##        3528       13041        5774        6826       11872        9157 
## -0.06010001  0.07213705 -0.05988558 -0.06740608 -0.09548012 -0.05768331 
##        9557        8320        5381       17450       16571        3587 
##  0.07612202  0.08972642  0.06099323  0.05648136 -0.05498992 -0.08332844 
##       14648        7887        5571       19467       19962       15618 
##  0.06076670 -0.07720355 -0.05617430  0.09563764  0.07903091  0.13755162 
##        2383        5673        7098       13811       16016        8785 
## -0.07091827 -0.08442465 -0.13177439 -0.06580293  0.05735440 -0.07271719 
##       13967        6769       13826        7123       18024        7270 
## -0.21007995 -0.05546427 -0.05994089  0.05666347  0.05640313 -0.05664152 
##       14582        1200         351        3253        5590       13257 
## -0.09437688 -0.06844834  0.07804958 -0.14603333  0.07617241  0.07484506 
##       18070       14921        9323        6515       20536         270 
##  0.07505360  0.08647714  0.05978756  0.06111724  0.09779741  0.10521380 
##       20008       16039        4812       13629       10284       16715 
##  0.07277246 -0.07018999 -0.08879640 -0.08706015 -0.06107481 -0.05746634 
##       14840        7607       14856       18780       15869        3109 
## -0.10241042  0.08654399 -0.05859315 -0.06341514 -0.07804742 -0.05699177 
##        8915       19913       16185       12940         420       13663 
## -0.08732471  0.06632595  0.11013388 -0.05626900 -0.07580941 -0.06525244 
##        7121       20370       12933        6614       13400         800 
## -0.09380302  0.07209740  0.08635548 -0.05533993  0.10794918 -0.05771530 
##       18589       19087        3379        7136        4524       11741 
##  0.09450220  0.05726069 -0.06376454  0.06704282  0.06685207  0.05732004 
##        1449        8665       15011       17350         444        2430 
##  0.12186388  0.07387338 -0.05571048  0.06983920  0.07733287  0.05669681 
##        8890        2564        6524        2304        3976        8058 
## -0.05909834  0.06717681 -0.05963004 -0.09045328 -0.05650660  0.05628600 
##       12337       20423        1244       10648        8444       11279 
##  0.05859208  0.05743791 -0.06525520  0.07298027  0.08636107  0.13728392 
##        6533        1437        7959        3278        6378        8817 
## -0.05740563 -0.06133080 -0.06976886  0.07068839  0.09847261  0.05952260 
##        8856       15945       19455       18803       18200        1957 
##  0.08657320  0.09209136 -0.06219256 -0.10145270  0.06869093  0.06047496 
##         466       13154       10920        6784        5720        1932 
## -0.09908183  0.06985926  0.05772840  0.06526192 -0.06446319 -0.08235971 
##        2086       18656        8642       19098       18646        5450 
##  0.05676922  0.05925669 -0.08508716 -0.10149944  0.08545311  0.09809084 
##        1808       20297       21103       13606        9033        3778 
## -0.07794957 -0.07737086 -0.07757591  0.13677372 -0.07564644  0.05722280 
##        3040       15145       13673        6832       15238        4049 
##  0.08309355 -0.06295635  0.06546142 -0.05598540  0.07460585 -0.07355320 
##       20579        1735       15021       17307       17950        7531 
## -0.07194348 -0.07100500  0.05887795  0.09257055 -0.13727144  0.07517333 
##        3583        3862        5600        7834       16289        8478 
##  0.07877095  0.06002156 -0.07312489  0.05900212  0.12097651 -0.05811937 
##       17115       10964       19987       15483       17570       11333 
## -0.06823092 -0.13255425  0.07622841 -0.09102489  0.06281269  0.05684365 
##       12649        3440       18293        1386        4769       10771 
##  0.11044821  0.06198071 -0.09749187 -0.05716780 -0.12571127 -0.05548057 
##        2787         876        1850        4564        9173        2865 
##  0.06150604  0.16182841 -0.07198668 -0.06535846  0.06076399  0.11058515 
##        2798         498       18536       13965       14140       11106 
##  0.05572731 -0.07741593 -0.06305540  0.06039245  0.08796435 -0.05777285 
##       17408       20326       17768        8161       10470       10447 
##  0.05592642  0.09141891  0.13726908 -0.05688934  0.05971804  0.09951650 
##        8979       19418        7370        3952       10421       12510 
##  0.07139133 -0.09288807 -0.11166059 -0.07450546 -0.06859208  0.10060053 
##       13149        6692        5562       18006       12778       18512 
##  0.08848013  0.27256779 -0.06141728  0.12487391 -0.72460199  0.08421228 
##       20564        2047       16427       20253        3719        4582 
## -0.06746613  0.07409474 -0.05691316  0.07615391  0.05520506 -0.06155510 
##       21324        2041       19726        9970       17876        1573 
##  0.06592488  0.06289881  0.07548353  0.06159988  0.05629513  0.09159554 
##       13543        8386       21142        3259       12819       12714 
##  0.05944797  0.05806847  0.09361514  0.07777520  0.06924573  0.08797078 
##        5833       18849        2142       14551        5064       15377 
##  0.13276578 -0.35866595 -0.07143585  0.14226512 -0.05587090  0.06468140 
##       17903         892        8130       19673       18976       13072 
##  0.06526040  0.09096267 -0.05980854 -0.06401717 -0.06468962  0.09935068 
##        7320        8717       12232       14828       17810       11953 
##  0.08369567  0.09243602 -0.06600442 -0.15894459 -0.06812206  0.08972838 
##        9106       13488       19260        2474       12614        3092 
##  0.06778837 -0.06133681 -0.09868033  0.07355447  0.11958108  0.05709639 
##       19117       15775       13442        1731        5119       12906 
## -0.07507886 -0.09870355  0.09714945  0.06245940 -0.05996527  0.06241194 
##       16804        8388       16414        8223        7069        7253 
## -0.07986587  0.09273750 -0.06946000  0.07765127  0.09035226 -0.19148554 
##       13549       14783        1880        9715       18209        3185 
##  0.07408928 -0.06249606  0.08381725 -0.07617292  0.12827603 -0.06599895 
##        5139        2075       17252        7429        7251       17083 
##  0.06488671 -0.09973704 -0.05540390  0.05660453  0.06052728  0.08680303 
##       14900        7452        2307       16843        6013       19732 
## -0.07452334  0.06002563 -0.09115086 -0.10731967  0.06700331 -0.11145337 
##        5640        5776        2412       15871        5551       14048 
## -0.05972987  0.06252856 -0.13864874  1.28527891 -0.07021696 -0.08980036 
##        1755       10476       20145        8531       13261       20371 
## -0.07089967  0.05753903  0.10968522 -0.08342586  0.05898900  0.10081856 
##       19824        9722       17957        9302        8326       17572 
##  0.08096301  0.07985825  0.06187136  0.05601557 -0.07096757 -0.12385910 
##       10469       10428        1485       14255        7701       13295 
##  0.10106618  0.09431438  0.05521356  0.11258296 -0.05646027  0.08761202 
##       18629       18394       17531        5191        4541       15588 
##  0.06481850  0.08673979  0.06913950  0.06145712  0.05906316 -0.07830251 
##       12936       16855        4602       17772        3238        9555 
## -0.12433825  0.07343956  0.05704021 -0.05808546 -0.06296861 -0.06012670 
##        3264        3672        1161       21373        8475       14056 
##  0.05673810  0.06050271 -0.07392745 -0.10941725 -0.05493448  0.06789239 
##        1032       10843       15139        1264        1531        7786 
##  0.08428998  0.07468501 -0.05821672 -0.05537226 -0.05827980 -0.06210923 
##        7996        6080       21149       13526       12684         417 
## -0.06751234  0.06842547  0.06490269 -0.12996782  0.06269976  0.06762985 
##       10326       13608        3387       14853       17290        5968 
##  0.05751493 -0.06866894  0.06807383 -0.06656989  0.07553078 -0.05842884 
##        5030        3806       13908        6464       14216       10170 
## -0.06593869 -0.08542474 -0.11040141 -0.05827068  0.05817783 -0.07371833 
##       10828        4273       15416         682        4203        9275 
## -0.05560873  0.05710634  0.17646129  0.07809119  0.07736733 -0.07218134 
##       21358       11703       14572       18273       15378       11774 
##  0.05530573 -0.10465324 -0.06404848  0.07908051  0.22437070  0.05616608 
##        3403       11797       14557       19623       15032        1623 
## -0.06702747 -0.05721405 -0.18730005  0.11858310 -0.14273870 -0.09195913 
##        6393       15007       13771        2150        4340        6103 
## -0.07502336 -0.09509251 -0.05954666 -0.07718390  0.10053608 -0.07772130 
##         786       10868       19337        5762       10526       15955 
## -0.05820930  0.05693276  0.05506844 -0.05589593  0.06949036  0.11308467 
##        7847       14188        4610        2846        9074       15693 
## -0.07913104 -0.12528360 -0.05860391  0.08752495 -0.11766781 -0.25862054 
##       14902       14570       12980       15294        1927        9324 
##  0.10451500  0.05572357 -0.05501150 -0.10136499 -0.05957692  0.08843227 
##       12886        4913        7433       15516       15411       14367 
##  0.07673982 -0.08827226  0.11927691 -0.05910905  0.11566163 -0.06974006 
##       11603        5413       14798        1273       17708        5609 
##  0.05649084 -0.06172797  0.06466677  0.09728207 -0.05488721  0.06985099 
##       10070        1472       13117        5593       11709        8997 
## -0.06877214  0.06395512 -0.06465741  0.07087921  0.05513587  0.05905700 
##        5163       13199        1257       20501       11254       17102 
##  0.06367470  0.05989496 -0.05669088  0.08786917 -0.06998760  0.05724316 
##       11160       17860       11299        7541       19889         591 
## -0.11092348 -0.12401331 -0.06124001 -0.05539894  0.05569711  0.06658016 
##       12209        4388        6380       18965       21072        8278 
## -0.08357803 -0.08164945 -0.06544323  0.06270183  0.05673104 -0.28920841 
##         428       11730       10620        1938        9857       11634 
##  0.07774010 -0.11938563 -0.06587556  0.06502878 -0.10297896  0.08466670 
##        6155       18794       15634       14349       14148       19105 
## -0.07909587  0.06805622 -0.05909267 -0.06069094  0.06322879 -0.15998128 
##       11218       18305        6399         485        8345       20898 
##  0.06241337  0.06394627  0.07013151  0.06053187 -0.12046489  0.05566937 
##        7281       17931       16107        3802       17839        5026 
## -0.06700666  0.09049822  0.07556182  0.07542499  0.06520629  0.06558673 
##       16082        1283        3353        5434       13313       12135 
##  0.05646463  0.12498455  0.06505477  0.08663073 -0.14970723  0.11597006 
##        4409        2059       19092       16341       14323        9450 
##  0.08903873 -0.06592801 -0.06876561 -0.06030106  0.06405057  0.07364000 
##       12397       19530       14099        5321        4135        6907 
##  0.05722777  0.08109214  0.11630555 -0.06332987 -0.06093333 -0.08514851 
##       21569        5990       13917       19189       16779        9199 
##  0.06172140  0.05845615  0.06268721 -0.19720352 -0.06672879 -0.11446217 
##       20453       13559       13676       18228       15888        8832 
##  0.26654896 -0.05569079  0.08544398  0.05900581  0.09314487  0.09014571 
##       11891       15893        9795        8607        2688       14053 
##  0.06917554  0.11098310  0.10220817  0.08426558  0.05635489  0.10657598 
##       12482       12288       11955       17716       11933        6550 
## -0.05979510  0.05756861 -0.05643967  0.12444524 -0.07415478 -0.06889916 
##        6036       10374        3169       10983       18483        9778 
## -0.05886720 -0.08136845 -0.05983283  0.06431959  0.07977601 -0.06220027 
##        7314       18478        5865       16970        2882        3282 
##  0.10374048 -0.08871755  0.06495482  0.07176723 -0.06846825  0.07337423 
##        1585        5851       19523       19778        9851       15298 
##  0.07436225 -0.07927791 -0.10897348  0.06621383  0.06468412  0.06366740 
##       12019        8674        5618       20951       12058       11366 
##  0.06954498  0.07774463  0.09332332  0.05867510  0.06375438 -0.06054659 
##       14012       10860       13058        8538       17797       13622 
## -0.05632986 -0.06141979 -0.06341953  0.06262987 -0.05869278  0.05965714 
##        1627       19386       13628       10023       18555        4792 
## -0.09529229 -0.08103094 -0.07020420 -0.07811609  0.07054976  0.07313937 
##        9548       14985       15329        7993        6433        3688 
##  0.09940457 -0.08067415  0.07834550 -0.09365170 -0.07091925  0.08574485 
##       10980       12115        3464        4036        2409        3955 
##  0.06749533 -0.09788558  0.08884797  0.19367404  0.08816436 -0.06680181 
##       13942       17387        8537       11785        8446       11054 
## -0.08900517  0.10213393  0.08802105 -0.07799622  0.10910560 -0.07087861 
##       19985       18344       21051       19018       19462        8645 
##  0.09831118 -0.07291265 -0.27518356  0.06615561 -0.11791368  0.05913601 
##        8249        7912       15752       16996       18877        4938 
## -0.07664036  0.11737763 -0.12766052 -0.07213014  0.15035517  0.08575936 
##        9412       10192        7518        3322 
##  0.06600336 -0.07604076 -0.07537319 -0.06403156

As we now know influential points, let’s remove them from our train data.

indices <- rownames(data.frame(z))

train2 <- train
train2$indices <- rownames(train2)

for (i in indices) {
  train2 <- train2 %>% filter(indices != i)
}

nrow(train2)
## [1] 16362

After deleting outliers, high leverage points, and influential points, the performance of our model has increased.

Also, our new model meets all regeression assumptions.

result3 <- lm(price ~ bedrooms + bathrooms + floors + waterfront + view + condition + grade + yr_built + sqft_living + 
               sqft_living15 + sqft_lot15 + renovated, data = train2)
summary(result3)
## 
## Call:
## lm(formula = price ~ bedrooms + bathrooms + floors + waterfront + 
##     view + condition + grade + yr_built + sqft_living + sqft_living15 + 
##     sqft_lot15 + renovated, data = train2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.36488 -0.07307  0.00445  0.07407  0.37689 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    2.918e+00  9.293e-03 314.005  < 2e-16 ***
## bedrooms      -1.220e-02  1.199e-03 -10.169  < 2e-16 ***
## bathrooms      2.796e-02  1.647e-03  16.975  < 2e-16 ***
## floors         4.394e-02  1.872e-03  23.472  < 2e-16 ***
## waterfront     1.508e-01  1.470e-02  10.260  < 2e-16 ***
## view           1.966e-02  1.324e-03  14.847  < 2e-16 ***
## condition      1.784e-02  1.376e-03  12.969  < 2e-16 ***
## grade          7.446e-02  1.279e-03  58.220  < 2e-16 ***
## yr_built      -4.079e-02  7.413e-04 -55.021  < 2e-16 ***
## sqft_living    5.526e-05  2.094e-06  26.394  < 2e-16 ***
## sqft_living15  3.414e-05  2.099e-06  16.264  < 2e-16 ***
## sqft_lot15    -2.928e-07  3.788e-08  -7.731 1.13e-14 ***
## renovated      1.110e-02  4.850e-03   2.289   0.0221 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1033 on 16349 degrees of freedom
## Multiple R-squared:  0.6727, Adjusted R-squared:  0.6725 
## F-statistic:  2801 on 12 and 16349 DF,  p-value: < 2.2e-16
test$predict <- round(predict(result3, newdata = test) ^ 10)
test_mse_ln_3 <- mean((test$price - test$predict)^2)
test_mse_ln_3
## [1] 44104007128
summary(result3)$r.squared
## [1] 0.6727283
summary(result3)$adj.r.squared
## [1] 0.6724881
PRESS(result3)
## [1] 174.7996
##Find SST 
anova_result<-anova(result3) 
SST <- sum(anova_result$"Sum Sq") 

##R2 pred 
Rsq_pred <- 1-PRESS(result3)/SST 
Rsq_pred
## [1] 0.6723251
yhat<-result3$fitted.values 
res<-result3$residuals
Data<-data.frame(train2,yhat,res)

ggplot(Data, aes(x=yhat,y=res))+
  geom_point()+
  geom_hline(yintercept=0, color="red")+
  labs(x="Fitted y",
       y="Residuals",
       title="Residual Plot")

acf(res)

qqnorm(res)
qqline(res, col="red")

vif(result3)
##      bedrooms     bathrooms        floors    waterfront          view 
##      1.683463      2.093926      1.560657      1.109195      1.231753 
##     condition         grade      yr_built   sqft_living sqft_living15 
##      1.228025      3.110906      1.808906      4.792639      2.929764 
##    sqft_lot15     renovated 
##      1.074602      1.100928

In order to deal with negative coefficient in sqft_lot15 predictor, let’s implement log transformation on sqft_lot15.

ggplot(train2, aes(x = sqft_lot15, y = price)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(x = "Sqft Lot15", y = "Price", title = "A Scatterplot of Sqft Lot 15 vs Price")
## `geom_smooth()` using formula 'y ~ x'

ggplot(train2, aes(x = log(sqft_lot15), y = price)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(x = "Sqft Lot15", y = "Price", title = "A Scatterplot of Sqft Lot 15 vs Price")
## `geom_smooth()` using formula 'y ~ x'

Although log transformation sqft_lot15 predictor, the coefficient is still negative. However, the general performance of our model has slightly increased than the previous one.

result4 <- lm(price ~ bedrooms + bathrooms + floors + waterfront + view + condition + grade + yr_built + sqft_living + 
               sqft_living15 + log(sqft_lot15) + renovated, data = train2)
summary(result4)
## 
## Call:
## lm(formula = price ~ bedrooms + bathrooms + floors + waterfront + 
##     view + condition + grade + yr_built + sqft_living + sqft_living15 + 
##     log(sqft_lot15) + renovated, data = train2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.35417 -0.07130  0.00412  0.07211  0.38181 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      3.160e+00  1.418e-02 222.882  < 2e-16 ***
## bedrooms        -1.155e-02  1.178e-03  -9.799  < 2e-16 ***
## bathrooms        2.370e-02  1.637e-03  14.478  < 2e-16 ***
## floors           2.861e-02  1.980e-03  14.448  < 2e-16 ***
## waterfront       1.684e-01  1.452e-02  11.599  < 2e-16 ***
## view             1.838e-02  1.307e-03  14.065  < 2e-16 ***
## condition        1.931e-02  1.359e-03  14.214  < 2e-16 ***
## grade            7.324e-02  1.262e-03  58.029  < 2e-16 ***
## yr_built        -3.961e-02  7.329e-04 -54.039  < 2e-16 ***
## sqft_living      6.243e-05  2.089e-06  29.882  < 2e-16 ***
## sqft_living15    4.404e-05  2.122e-06  20.754  < 2e-16 ***
## log(sqft_lot15) -2.787e-02  1.228e-03 -22.698  < 2e-16 ***
## renovated        1.568e-02  4.788e-03   3.275  0.00106 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1019 on 16349 degrees of freedom
## Multiple R-squared:  0.6816, Adjusted R-squared:  0.6813 
## F-statistic:  2916 on 12 and 16349 DF,  p-value: < 2.2e-16
test$predict <- round(predict(result4, newdata = test) ^ 10)
test_mse_ln_4 <- mean((test$price - test$predict)^2)
test_mse_ln_4
## [1] 43387370710
summary(result4)$r.squared
## [1] 0.6815662
summary(result4)$adj.r.squared
## [1] 0.6813325
PRESS(result4)
## [1] 170.087
##Find SST 
anova_result<-anova(result4) 
SST <- sum(anova_result$"Sum Sq") 

##R2 pred 
Rsq_pred <- 1-PRESS(result4)/SST 
Rsq_pred
## [1] 0.6811591
yhat<-result4$fitted.values 
res<-result4$residuals
Data<-data.frame(train2,yhat,res)

ggplot(Data, aes(x=yhat,y=res))+
  geom_point()+
  geom_hline(yintercept=0, color="red")+
  labs(x="Fitted y",
       y="Residuals",
       title="Residual Plot")

acf(res)

qqnorm(res)
qqline(res, col="red")

vif(result4)
##        bedrooms       bathrooms          floors      waterfront            view 
##        1.670024        2.125405        1.794334        1.112768        1.234111 
##       condition           grade        yr_built     sqft_living   sqft_living15 
##        1.231155        3.113223        1.817342        4.904594        3.076865 
## log(sqft_lot15)       renovated 
##        1.462865        1.102725

In order to boost the predictive performance of our model let’s take fully advantage of zipcode predictor that we dropped at the beginning. In our dataset, we have total 70 distinct zipcodes of King County.

house2 <- read.csv('kc_house_data.csv')
house2 <- house2 %>% dplyr::select(-id, -date, -lat, -long)

zipcode <- unique(house2$zipcode)
zipcode
##  [1] 98178 98125 98028 98136 98074 98053 98003 98198 98146 98038 98007 98115
## [13] 98107 98126 98019 98103 98002 98133 98040 98092 98030 98119 98112 98052
## [25] 98027 98117 98058 98001 98056 98166 98023 98070 98148 98105 98042 98008
## [37] 98059 98122 98144 98004 98005 98034 98075 98116 98010 98118 98199 98032
## [49] 98045 98102 98077 98108 98168 98177 98065 98029 98006 98109 98022 98033
## [61] 98155 98024 98011 98031 98106 98072 98188 98014 98055 98039

From www.niche.com, we can derive zipcodes of King County with overall grade above A-. Then, create new categorical variable displaying 1, if zipcode is in the list, 0 otherwise. The distribution of two categories are pretty well-balanced by looking at the histogram below.

# https://www.niche.com/places-to-live/search/best-zip-codes-to-live/c/king-county-wa/
# Overall Grade > A-
good_zip <- c(98004, 98005, 98052, 98121, 98007, 98109, 98033, 98122, 98029, 98006, 98103, 98102, 98074, 98101, 98040, 98115, 98112, 98107, 98119, 98105, 98075, 98008, 98116, 98053, 98034, 98039, 98144, 98199, 98117, 98104, 98028, 98027, 98011, 98177, 98125, 98065, 98072, 98077, 98126, 98155, 98136, 98059, 98133, 98188, 98106)

house2$good_neigh <- ifelse(house2$zipcode %in% good_zip, 1, 0)

hist(house2$good_neigh)

house2$yr_built <- case_when(
  (1900 <= house2$yr_built) &  (house2$yr_built< 1920) ~ 0,
  (1920 <= house2$yr_built) &  (house2$yr_built< 1940) ~ 1,
  (1940 <= house2$yr_built) &  (house2$yr_built< 1960) ~ 2,
  (1960 <= house2$yr_built) &  (house2$yr_built< 1980) ~ 3,
  (1980 <= house2$yr_built) &  (house2$yr_built< 2000) ~ 4,
  (2000 <= house2$yr_built) ~ 5)

house2$renovated <- ifelse(house2$yr_renovated != 0, 1, 0)

house2 <- house2 %>% dplyr::select(-zipcode)

house2 <- house2 %>% dplyr::select(-yr_renovated)
head(house2)
##     price bedrooms bathrooms sqft_living sqft_lot floors waterfront view
## 1  221900        3      1.00        1180     5650      1          0    0
## 2  538000        3      2.25        2570     7242      2          0    0
## 3  180000        2      1.00         770    10000      1          0    0
## 4  604000        4      3.00        1960     5000      1          0    0
## 5  510000        3      2.00        1680     8080      1          0    0
## 6 1225000        4      4.50        5420   101930      1          0    0
##   condition grade sqft_above sqft_basement yr_built sqft_living15 sqft_lot15
## 1         3     7       1180             0        2          1340       5650
## 2         3     7       2170           400        2          1690       7639
## 3         3     6        770             0        1          2720       8062
## 4         5     7       1050           910        3          1360       5000
## 5         3     8       1680             0        4          1800       7503
## 6         3    11       3890          1530        5          4760     101930
##   good_neigh renovated
## 1          0         0
## 2          1         1
## 3          1         0
## 4          1         0
## 5          1         0
## 6          1         0
set.seed(1) ##for reproducibility to get the same split
sample<-sample.int(nrow(house), floor(.80*nrow(house2)), replace = F)
train2 <- house2[sample, ] ##training data frame
test2 <- house2[-sample, ] ##test data frame
head(train2)
##        price bedrooms bathrooms sqft_living sqft_lot floors waterfront view
## 17401 550000        3      1.75        2910    35200    1.5          0    0
## 4775  275000        4      2.50        2120     6754    2.0          0    0
## 13218 455000        5      2.00        1510     3000    2.0          0    0
## 10539 384950        3      2.50        1860     3690    2.0          0    0
## 8462  140000        2      1.00         900     6400    1.0          0    0
## 4050  925000        3      2.50        2690     7000    2.0          0    0
##       condition grade sqft_above sqft_basement yr_built sqft_living15
## 17401         3     8       2910             0        3          2590
## 4775          3     7       2120             0        4          2120
## 13218         3     6       1510             0        4          1610
## 10539         3     7       1860             0        5          1870
## 8462          2     6        900             0        2          1350
## 4050          5     7       1840           850        2          1800
##       sqft_lot15 good_neigh renovated
## 17401      37500          1         0
## 4775        6937          0         0
## 13218       3600          1         0
## 10539       4394          1         0
## 8462        6405          0         0
## 4050        6435          1         0

Like we used to do in the beginning, let’s take on automated search procedure to filter out predictors.

regnull <- lm(price ~ 1, data = train2)
regfull <- lm(price ~ ., data = train2)
step(regnull, scope = list(lower = regnull, upper = regfull), direction = "both")
## Start:  AIC=442775.7
## price ~ 1
## 
##                 Df  Sum of Sq        RSS    AIC
## + sqft_living    1 1.1329e+15 1.1553e+15 430962
## + grade          1 1.0242e+15 1.2640e+15 432516
## + sqft_above     1 8.4127e+14 1.4469e+15 434853
## + sqft_living15  1 7.8905e+14 1.4992e+15 435466
## + bathrooms      1 6.2379e+14 1.6644e+15 437274
## + good_neigh     1 3.7528e+14 1.9129e+15 439680
## + view           1 3.5639e+14 1.9318e+15 439850
## + sqft_basement  1 2.3919e+14 2.0490e+15 440869
## + bedrooms       1 2.0920e+14 2.0790e+15 441120
## + waterfront     1 1.7097e+14 2.1172e+15 441435
## + floors         1 1.6137e+14 2.1268e+15 441513
## + renovated      1 3.5656e+13 2.2525e+15 442506
## + sqft_lot       1 2.0037e+13 2.2682e+15 442626
## + sqft_lot15     1 1.5906e+13 2.2723e+15 442657
## + yr_built       1 5.6435e+12 2.2826e+15 442735
## + condition      1 3.4753e+12 2.2847e+15 442751
## <none>                        2.2882e+15 442776
## 
## Step:  AIC=430962
## price ~ sqft_living
## 
##                 Df  Sum of Sq        RSS    AIC
## + good_neigh     1 2.2094e+14 9.3438e+14 427294
## + view           1 9.6282e+13 1.0590e+15 429459
## + grade          1 9.6101e+13 1.0592e+15 429462
## + waterfront     1 9.0018e+13 1.0653e+15 429561
## + yr_built       1 6.8189e+13 1.0871e+15 429912
## + bedrooms       1 3.3062e+13 1.1223e+15 430462
## + renovated      1 1.6775e+13 1.1386e+15 430711
## + sqft_living15  1 1.6529e+13 1.1388e+15 430715
## + condition      1 1.3494e+13 1.1418e+15 430761
## + sqft_lot15     1 6.0106e+12 1.1493e+15 430874
## + sqft_lot       1 3.2768e+12 1.1520e+15 430915
## + sqft_above     1 1.1799e+12 1.1541e+15 430946
## + sqft_basement  1 1.1799e+12 1.1541e+15 430946
## + floors         1 3.1999e+11 1.1550e+15 430959
## + bathrooms      1 2.4923e+11 1.1551e+15 430960
## <none>                        1.1553e+15 430962
## - sqft_living    1 1.1329e+15 2.2882e+15 442776
## 
## Step:  AIC=427294.2
## price ~ sqft_living + good_neigh
## 
##                 Df  Sum of Sq        RSS    AIC
## + waterfront     1 1.0049e+14 8.3390e+14 425329
## + view           1 9.3769e+13 8.4061e+14 425468
## + grade          1 4.8715e+13 8.8567e+14 426370
## + yr_built       1 4.1487e+13 8.9289e+14 426511
## + bedrooms       1 2.6106e+13 9.0828e+14 426806
## + renovated      1 1.4132e+13 9.2025e+14 427033
## + condition      1 1.1442e+13 9.2294e+14 427083
## + sqft_living15  1 8.4179e+12 9.2596e+14 427140
## + bathrooms      1 7.2292e+11 9.3366e+14 427283
## + floors         1 5.1669e+11 9.3386e+14 427287
## + sqft_lot       1 1.6274e+11 9.3422e+14 427293
## <none>                        9.3438e+14 427294
## + sqft_lot15     1 9.4700e+09 9.3437e+14 427296
## + sqft_above     1 1.7768e+07 9.3438e+14 427296
## + sqft_basement  1 1.7768e+07 9.3438e+14 427296
## - good_neigh     1 2.2094e+14 1.1553e+15 430962
## - sqft_living    1 9.7854e+14 1.9129e+15 439680
## 
## Step:  AIC=425329
## price ~ sqft_living + good_neigh + waterfront
## 
##                 Df  Sum of Sq        RSS    AIC
## + grade          1 4.6444e+13 7.8745e+14 424340
## + view           1 3.9679e+13 7.9422e+14 424488
## + yr_built       1 3.3563e+13 8.0033e+14 424621
## + bedrooms       1 1.7950e+13 8.1595e+14 424955
## + condition      1 1.0031e+13 8.2386e+14 425122
## + renovated      1 8.3178e+12 8.2558e+14 425158
## + sqft_living15  1 7.8908e+12 8.2600e+14 425167
## + floors         1 3.9438e+11 8.3350e+14 425323
## + bathrooms      1 3.7414e+11 8.3352e+14 425323
## + sqft_lot       1 2.4130e+11 8.3365e+14 425326
## + sqft_above     1 2.3809e+11 8.3366e+14 425326
## + sqft_basement  1 2.3809e+11 8.3366e+14 425326
## <none>                        8.3390e+14 425329
## + sqft_lot15     1 3.0893e+10 8.3386e+14 425330
## - waterfront     1 1.0049e+14 9.3438e+14 427294
## - good_neigh     1 2.3141e+14 1.0653e+15 429561
## - sqft_living    1 8.9762e+14 1.7315e+15 437960
## 
## Step:  AIC=424340.1
## price ~ sqft_living + good_neigh + waterfront + grade
## 
##                 Df  Sum of Sq        RSS    AIC
## + yr_built       1 7.8000e+13 7.0945e+14 422539
## + view           1 3.4827e+13 7.5262e+14 423560
## + condition      1 1.8697e+13 7.6875e+14 423927
## + bedrooms       1 1.0676e+13 7.7678e+14 424106
## + renovated      1 1.0488e+13 7.7696e+14 424110
## + floors         1 7.7556e+12 7.7970e+14 424171
## + bathrooms      1 4.5855e+12 7.8287e+14 424241
## + sqft_above     1 2.8014e+12 7.8465e+14 424281
## + sqft_basement  1 2.8014e+12 7.8465e+14 424281
## + sqft_living15  1 4.2468e+11 7.8703e+14 424333
## + sqft_lot       1 2.9435e+11 7.8716e+14 424336
## <none>                        7.8745e+14 424340
## + sqft_lot15     1 1.4095e+10 7.8744e+14 424342
## - grade          1 4.6444e+13 8.3390e+14 425329
## - waterfront     1 9.8214e+13 8.8567e+14 426370
## - good_neigh     1 1.8343e+14 9.7088e+14 427959
## - sqft_living    1 2.0959e+14 9.9704e+14 428418
## 
## Step:  AIC=422538.6
## price ~ sqft_living + good_neigh + waterfront + grade + yr_built
## 
##                 Df  Sum of Sq        RSS    AIC
## + view           1 1.9437e+13 6.9001e+14 422060
## + bedrooms       1 1.0695e+13 6.9876e+14 422278
## + condition      1 2.3116e+12 7.0714e+14 422484
## + bathrooms      1 1.9274e+12 7.0752e+14 422494
## + renovated      1 1.3410e+12 7.0811e+14 422508
## + sqft_living15  1 9.3984e+11 7.0851e+14 422518
## + floors         1 3.5352e+11 7.0910e+14 422532
## + sqft_lot       1 1.4158e+11 7.0931e+14 422537
## + sqft_above     1 1.1899e+11 7.0933e+14 422538
## + sqft_basement  1 1.1899e+11 7.0933e+14 422538
## <none>                        7.0945e+14 422539
## + sqft_lot15     1 2.3719e+10 7.0943e+14 422540
## - yr_built       1 7.8000e+13 7.8745e+14 424340
## - waterfront     1 8.4681e+13 7.9413e+14 424486
## - grade          1 9.0881e+13 8.0033e+14 424621
## - good_neigh     1 1.3021e+14 8.3966e+14 425450
## - sqft_living    1 1.9636e+14 9.0581e+14 426761
## 
## Step:  AIC=422060.3
## price ~ sqft_living + good_neigh + waterfront + grade + yr_built + 
##     view
## 
##                 Df  Sum of Sq        RSS    AIC
## + bedrooms       1 8.5389e+12 6.8148e+14 421847
## + condition      1 2.1251e+12 6.8789e+14 422009
## + bathrooms      1 1.7060e+12 6.8831e+14 422020
## + sqft_above     1 1.2459e+12 6.8877e+14 422031
## + sqft_basement  1 1.2459e+12 6.8877e+14 422031
## + renovated      1 1.0939e+12 6.8892e+14 422035
## + floors         1 5.9382e+11 6.8942e+14 422047
## + sqft_living15  1 2.5814e+11 6.8976e+14 422056
## <none>                        6.9001e+14 422060
## + sqft_lot       1 7.8955e+10 6.8994e+14 422060
## + sqft_lot15     1 5.4919e+10 6.8996e+14 422061
## - view           1 1.9437e+13 7.0945e+14 422539
## - waterfront     1 4.6785e+13 7.3680e+14 423193
## - yr_built       1 6.2610e+13 7.5262e+14 423560
## - grade          1 7.9779e+13 7.6979e+14 423950
## - good_neigh     1 1.3270e+14 8.2272e+14 425100
## - sqft_living    1 1.7908e+14 8.6910e+14 426048
## 
## Step:  AIC=421847
## price ~ sqft_living + good_neigh + waterfront + grade + yr_built + 
##     view + bedrooms
## 
##                 Df  Sum of Sq        RSS    AIC
## + bathrooms      1 3.8353e+12 6.7764e+14 421751
## + condition      1 2.7023e+12 6.7877e+14 421780
## + renovated      1 1.0256e+12 6.8045e+14 421823
## + sqft_above     1 9.6664e+11 6.8051e+14 421824
## + sqft_basement  1 9.6664e+11 6.8051e+14 421824
## + floors         1 6.3527e+11 6.8084e+14 421833
## + sqft_lot15     1 3.0405e+11 6.8117e+14 421841
## + sqft_living15  1 1.8797e+11 6.8129e+14 421844
## <none>                        6.8148e+14 421847
## + sqft_lot       1 1.7386e+08 6.8148e+14 421849
## - bedrooms       1 8.5389e+12 6.9001e+14 422060
## - view           1 1.7281e+13 6.9876e+14 422278
## - waterfront     1 4.4688e+13 7.2616e+14 422943
## - yr_built       1 6.3278e+13 7.4475e+14 423380
## - grade          1 7.2012e+13 7.5349e+14 423582
## - good_neigh     1 1.3142e+14 8.1290e+14 424894
## - sqft_living    1 1.6973e+14 8.5120e+14 425690
## 
## Step:  AIC=421751.4
## price ~ sqft_living + good_neigh + waterfront + grade + yr_built + 
##     view + bedrooms + bathrooms
## 
##                 Df  Sum of Sq        RSS    AIC
## + condition      1 2.5458e+12 6.7509e+14 421688
## + sqft_above     1 1.2960e+12 6.7634e+14 421720
## + sqft_basement  1 1.2960e+12 6.7634e+14 421720
## + renovated      1 5.6770e+11 6.7707e+14 421739
## + sqft_living15  1 4.1997e+11 6.7722e+14 421743
## + sqft_lot15     1 1.8687e+11 6.7745e+14 421749
## + floors         1 1.3107e+11 6.7751e+14 421750
## <none>                        6.7764e+14 421751
## + sqft_lot       1 5.1651e+09 6.7764e+14 421753
## - bathrooms      1 3.8353e+12 6.8148e+14 421847
## - bedrooms       1 1.0668e+13 6.8831e+14 422020
## - view           1 1.6672e+13 6.9431e+14 422170
## - waterfront     1 4.4549e+13 7.2219e+14 422850
## - yr_built       1 6.5320e+13 7.4296e+14 423341
## - grade          1 6.8072e+13 7.4571e+14 423404
## - sqft_living    1 1.2077e+14 7.9841e+14 424585
## - good_neigh     1 1.2765e+14 8.0529e+14 424734
## 
## Step:  AIC=421688.3
## price ~ sqft_living + good_neigh + waterfront + grade + yr_built + 
##     view + bedrooms + bathrooms + condition
## 
##                 Df  Sum of Sq        RSS    AIC
## + sqft_above     1 1.7795e+12 6.7332e+14 421645
## + sqft_basement  1 1.7795e+12 6.7332e+14 421645
## + renovated      1 1.0676e+12 6.7403e+14 421663
## + sqft_living15  1 4.5095e+11 6.7464e+14 421679
## + floors         1 3.2529e+11 6.7477e+14 421682
## + sqft_lot15     1 2.0826e+11 6.7489e+14 421685
## <none>                        6.7509e+14 421688
## + sqft_lot       1 4.5832e+09 6.7509e+14 421690
## - condition      1 2.5458e+12 6.7764e+14 421751
## - bathrooms      1 3.6788e+12 6.7877e+14 421780
## - bedrooms       1 1.1226e+13 6.8632e+14 421971
## - view           1 1.6427e+13 6.9152e+14 422102
## - waterfront     1 4.4613e+13 7.1971e+14 422793
## - yr_built       1 5.1992e+13 7.2709e+14 422969
## - grade          1 6.8864e+13 7.4396e+14 423366
## - sqft_living    1 1.2018e+14 7.9528e+14 424519
## - good_neigh     1 1.2824e+14 8.0334e+14 424693
## 
## Step:  AIC=421644.7
## price ~ sqft_living + good_neigh + waterfront + grade + yr_built + 
##     view + bedrooms + bathrooms + condition + sqft_above
## 
##                 Df  Sum of Sq        RSS    AIC
## + renovated      1 9.9938e+11 6.7232e+14 421621
## + sqft_lot15     1 3.0093e+11 6.7301e+14 421639
## + sqft_living15  1 2.2173e+11 6.7309e+14 421641
## <none>                        6.7332e+14 421645
## + floors         1 1.2372e+09 6.7331e+14 421647
## + sqft_lot       1 6.7932e+08 6.7331e+14 421647
## - sqft_above     1 1.7795e+12 6.7509e+14 421688
## - condition      1 3.0293e+12 6.7634e+14 421720
## - bathrooms      1 4.0513e+12 6.7737e+14 421746
## - bedrooms       1 1.1021e+13 6.8434e+14 421923
## - view           1 1.7822e+13 6.9114e+14 422094
## - waterfront     1 4.4101e+13 7.1742e+14 422740
## - sqft_living    1 5.2096e+13 7.2541e+14 422931
## - yr_built       1 5.3763e+13 7.2708e+14 422971
## - grade          1 6.0022e+13 7.3334e+14 423119
## - good_neigh     1 1.2997e+14 8.0329e+14 424694
## 
## Step:  AIC=421621
## price ~ sqft_living + good_neigh + waterfront + grade + yr_built + 
##     view + bedrooms + bathrooms + condition + sqft_above + renovated
## 
##                 Df  Sum of Sq        RSS    AIC
## + sqft_lot15     1 3.1404e+11 6.7200e+14 421615
## + sqft_living15  1 2.7048e+11 6.7205e+14 421616
## <none>                        6.7232e+14 421621
## + sqft_lot       1 9.6491e+08 6.7231e+14 421623
## + floors         1 1.2630e+08 6.7232e+14 421623
## - renovated      1 9.9938e+11 6.7332e+14 421645
## - sqft_above     1 1.7112e+12 6.7403e+14 421663
## - bathrooms      1 3.4175e+12 6.7573e+14 421707
## - condition      1 3.5334e+12 6.7585e+14 421710
## - bedrooms       1 1.0811e+13 6.8313e+14 421895
## - view           1 1.7581e+13 6.8990e+14 422065
## - waterfront     1 4.3210e+13 7.1553e+14 422696
## - yr_built       1 4.4719e+13 7.1703e+14 422732
## - sqft_living    1 5.2405e+13 7.2472e+14 422917
## - grade          1 5.9879e+13 7.3219e+14 423094
## - good_neigh     1 1.3046e+14 8.0278e+14 424685
## 
## Step:  AIC=421615
## price ~ sqft_living + good_neigh + waterfront + grade + yr_built + 
##     view + bedrooms + bathrooms + condition + sqft_above + renovated + 
##     sqft_lot15
## 
##                 Df  Sum of Sq        RSS    AIC
## + sqft_living15  1 3.0923e+11 6.7169e+14 421609
## + sqft_lot       1 2.9174e+11 6.7171e+14 421609
## <none>                        6.7200e+14 421615
## + floors         1 4.0939e+09 6.7200e+14 421617
## - sqft_lot15     1 3.1404e+11 6.7232e+14 421621
## - renovated      1 1.0125e+12 6.7301e+14 421639
## - sqft_above     1 1.8041e+12 6.7381e+14 421659
## - bathrooms      1 3.2882e+12 6.7529e+14 421697
## - condition      1 3.5833e+12 6.7558e+14 421705
## - bedrooms       1 1.1059e+13 6.8306e+14 421895
## - view           1 1.7678e+13 6.8968e+14 422062
## - waterfront     1 4.3153e+13 7.1515e+14 422689
## - yr_built       1 4.4582e+13 7.1658e+14 422724
## - sqft_living    1 5.2705e+13 7.2471e+14 422918
## - grade          1 5.9546e+13 7.3155e+14 423081
## - good_neigh     1 1.2606e+14 7.9806e+14 424586
## 
## Step:  AIC=421609
## price ~ sqft_living + good_neigh + waterfront + grade + yr_built + 
##     view + bedrooms + bathrooms + condition + sqft_above + renovated + 
##     sqft_lot15 + sqft_living15
## 
##                 Df  Sum of Sq        RSS    AIC
## + sqft_lot       1 3.1649e+11 6.7138e+14 421603
## <none>                        6.7169e+14 421609
## + floors         1 8.0363e+08 6.7169e+14 421611
## - sqft_living15  1 3.0923e+11 6.7200e+14 421615
## - sqft_lot15     1 3.5279e+11 6.7205e+14 421616
## - renovated      1 1.0660e+12 6.7276e+14 421634
## - sqft_above     1 1.5454e+12 6.7324e+14 421647
## - bathrooms      1 3.4379e+12 6.7513e+14 421695
## - condition      1 3.5957e+12 6.7529e+14 421699
## - bedrooms       1 1.1081e+13 6.8277e+14 421890
## - view           1 1.6748e+13 6.8844e+14 422033
## - waterfront     1 4.3361e+13 7.1505e+14 422689
## - yr_built       1 4.4802e+13 7.1649e+14 422723
## - sqft_living    1 4.8255e+13 7.1995e+14 422807
## - grade          1 5.3809e+13 7.2550e+14 422939
## - good_neigh     1 1.2502e+14 7.9672e+14 424558
## 
## Step:  AIC=421602.8
## price ~ sqft_living + good_neigh + waterfront + grade + yr_built + 
##     view + bedrooms + bathrooms + condition + sqft_above + renovated + 
##     sqft_lot15 + sqft_living15 + sqft_lot
## 
##                 Df  Sum of Sq        RSS    AIC
## <none>                        6.7138e+14 421603
## + floors         1 2.4246e+09 6.7137e+14 421605
## - sqft_lot       1 3.1649e+11 6.7169e+14 421609
## - sqft_living15  1 3.3398e+11 6.7171e+14 421609
## - sqft_lot15     1 6.6766e+11 6.7204e+14 421618
## - renovated      1 1.0742e+12 6.7245e+14 421628
## - sqft_above     1 1.4933e+12 6.7287e+14 421639
## - bathrooms      1 3.4356e+12 6.7481e+14 421689
## - condition      1 3.6152e+12 6.7499e+14 421694
## - bedrooms       1 1.0953e+13 6.8233e+14 421881
## - view           1 1.6649e+13 6.8802e+14 422024
## - waterfront     1 4.3541e+13 7.1492e+14 422687
## - yr_built       1 4.4574e+13 7.1595e+14 422712
## - sqft_living    1 4.7994e+13 7.1937e+14 422795
## - grade          1 5.3784e+13 7.2516e+14 422933
## - good_neigh     1 1.2534e+14 7.9671e+14 424560
## 
## Call:
## lm(formula = price ~ sqft_living + good_neigh + waterfront + 
##     grade + yr_built + view + bedrooms + bathrooms + condition + 
##     sqft_above + renovated + sqft_lot15 + sqft_living15 + sqft_lot, 
##     data = train2)
## 
## Coefficients:
##   (Intercept)    sqft_living     good_neigh     waterfront          grade  
##    -5.535e+05      1.603e+02      1.908e+05      6.392e+05      8.721e+04  
##      yr_built           view       bedrooms      bathrooms      condition  
##    -4.823e+04      4.781e+04     -3.451e+04      3.222e+04      2.423e+04  
##    sqft_above      renovated     sqft_lot15  sqft_living15       sqft_lot  
##     2.638e+01      4.198e+04     -3.387e-01      1.067e+01      1.643e-01
step(regnull, scope=list(lower=regnull, upper=regfull), direction="forward")
## Start:  AIC=442775.7
## price ~ 1
## 
##                 Df  Sum of Sq        RSS    AIC
## + sqft_living    1 1.1329e+15 1.1553e+15 430962
## + grade          1 1.0242e+15 1.2640e+15 432516
## + sqft_above     1 8.4127e+14 1.4469e+15 434853
## + sqft_living15  1 7.8905e+14 1.4992e+15 435466
## + bathrooms      1 6.2379e+14 1.6644e+15 437274
## + good_neigh     1 3.7528e+14 1.9129e+15 439680
## + view           1 3.5639e+14 1.9318e+15 439850
## + sqft_basement  1 2.3919e+14 2.0490e+15 440869
## + bedrooms       1 2.0920e+14 2.0790e+15 441120
## + waterfront     1 1.7097e+14 2.1172e+15 441435
## + floors         1 1.6137e+14 2.1268e+15 441513
## + renovated      1 3.5656e+13 2.2525e+15 442506
## + sqft_lot       1 2.0037e+13 2.2682e+15 442626
## + sqft_lot15     1 1.5906e+13 2.2723e+15 442657
## + yr_built       1 5.6435e+12 2.2826e+15 442735
## + condition      1 3.4753e+12 2.2847e+15 442751
## <none>                        2.2882e+15 442776
## 
## Step:  AIC=430962
## price ~ sqft_living
## 
##                 Df  Sum of Sq        RSS    AIC
## + good_neigh     1 2.2094e+14 9.3438e+14 427294
## + view           1 9.6282e+13 1.0590e+15 429459
## + grade          1 9.6101e+13 1.0592e+15 429462
## + waterfront     1 9.0018e+13 1.0653e+15 429561
## + yr_built       1 6.8189e+13 1.0871e+15 429912
## + bedrooms       1 3.3062e+13 1.1223e+15 430462
## + renovated      1 1.6775e+13 1.1386e+15 430711
## + sqft_living15  1 1.6529e+13 1.1388e+15 430715
## + condition      1 1.3494e+13 1.1418e+15 430761
## + sqft_lot15     1 6.0106e+12 1.1493e+15 430874
## + sqft_lot       1 3.2768e+12 1.1520e+15 430915
## + sqft_above     1 1.1799e+12 1.1541e+15 430946
## + sqft_basement  1 1.1799e+12 1.1541e+15 430946
## + floors         1 3.1999e+11 1.1550e+15 430959
## + bathrooms      1 2.4923e+11 1.1551e+15 430960
## <none>                        1.1553e+15 430962
## 
## Step:  AIC=427294.2
## price ~ sqft_living + good_neigh
## 
##                 Df  Sum of Sq        RSS    AIC
## + waterfront     1 1.0049e+14 8.3390e+14 425329
## + view           1 9.3769e+13 8.4061e+14 425468
## + grade          1 4.8715e+13 8.8567e+14 426370
## + yr_built       1 4.1487e+13 8.9289e+14 426511
## + bedrooms       1 2.6106e+13 9.0828e+14 426806
## + renovated      1 1.4132e+13 9.2025e+14 427033
## + condition      1 1.1442e+13 9.2294e+14 427083
## + sqft_living15  1 8.4179e+12 9.2596e+14 427140
## + bathrooms      1 7.2292e+11 9.3366e+14 427283
## + floors         1 5.1669e+11 9.3386e+14 427287
## + sqft_lot       1 1.6274e+11 9.3422e+14 427293
## <none>                        9.3438e+14 427294
## + sqft_lot15     1 9.4700e+09 9.3437e+14 427296
## + sqft_above     1 1.7768e+07 9.3438e+14 427296
## + sqft_basement  1 1.7768e+07 9.3438e+14 427296
## 
## Step:  AIC=425329
## price ~ sqft_living + good_neigh + waterfront
## 
##                 Df  Sum of Sq        RSS    AIC
## + grade          1 4.6444e+13 7.8745e+14 424340
## + view           1 3.9679e+13 7.9422e+14 424488
## + yr_built       1 3.3563e+13 8.0033e+14 424621
## + bedrooms       1 1.7950e+13 8.1595e+14 424955
## + condition      1 1.0031e+13 8.2386e+14 425122
## + renovated      1 8.3178e+12 8.2558e+14 425158
## + sqft_living15  1 7.8908e+12 8.2600e+14 425167
## + floors         1 3.9438e+11 8.3350e+14 425323
## + bathrooms      1 3.7414e+11 8.3352e+14 425323
## + sqft_lot       1 2.4130e+11 8.3365e+14 425326
## + sqft_above     1 2.3809e+11 8.3366e+14 425326
## + sqft_basement  1 2.3809e+11 8.3366e+14 425326
## <none>                        8.3390e+14 425329
## + sqft_lot15     1 3.0893e+10 8.3386e+14 425330
## 
## Step:  AIC=424340.1
## price ~ sqft_living + good_neigh + waterfront + grade
## 
##                 Df  Sum of Sq        RSS    AIC
## + yr_built       1 7.8000e+13 7.0945e+14 422539
## + view           1 3.4827e+13 7.5262e+14 423560
## + condition      1 1.8697e+13 7.6875e+14 423927
## + bedrooms       1 1.0676e+13 7.7678e+14 424106
## + renovated      1 1.0488e+13 7.7696e+14 424110
## + floors         1 7.7556e+12 7.7970e+14 424171
## + bathrooms      1 4.5855e+12 7.8287e+14 424241
## + sqft_above     1 2.8014e+12 7.8465e+14 424281
## + sqft_basement  1 2.8014e+12 7.8465e+14 424281
## + sqft_living15  1 4.2468e+11 7.8703e+14 424333
## + sqft_lot       1 2.9435e+11 7.8716e+14 424336
## <none>                        7.8745e+14 424340
## + sqft_lot15     1 1.4095e+10 7.8744e+14 424342
## 
## Step:  AIC=422538.6
## price ~ sqft_living + good_neigh + waterfront + grade + yr_built
## 
##                 Df  Sum of Sq        RSS    AIC
## + view           1 1.9437e+13 6.9001e+14 422060
## + bedrooms       1 1.0695e+13 6.9876e+14 422278
## + condition      1 2.3116e+12 7.0714e+14 422484
## + bathrooms      1 1.9274e+12 7.0752e+14 422494
## + renovated      1 1.3410e+12 7.0811e+14 422508
## + sqft_living15  1 9.3984e+11 7.0851e+14 422518
## + floors         1 3.5352e+11 7.0910e+14 422532
## + sqft_lot       1 1.4158e+11 7.0931e+14 422537
## + sqft_above     1 1.1899e+11 7.0933e+14 422538
## + sqft_basement  1 1.1899e+11 7.0933e+14 422538
## <none>                        7.0945e+14 422539
## + sqft_lot15     1 2.3719e+10 7.0943e+14 422540
## 
## Step:  AIC=422060.3
## price ~ sqft_living + good_neigh + waterfront + grade + yr_built + 
##     view
## 
##                 Df  Sum of Sq        RSS    AIC
## + bedrooms       1 8.5389e+12 6.8148e+14 421847
## + condition      1 2.1251e+12 6.8789e+14 422009
## + bathrooms      1 1.7060e+12 6.8831e+14 422020
## + sqft_above     1 1.2459e+12 6.8877e+14 422031
## + sqft_basement  1 1.2459e+12 6.8877e+14 422031
## + renovated      1 1.0939e+12 6.8892e+14 422035
## + floors         1 5.9382e+11 6.8942e+14 422047
## + sqft_living15  1 2.5814e+11 6.8976e+14 422056
## <none>                        6.9001e+14 422060
## + sqft_lot       1 7.8955e+10 6.8994e+14 422060
## + sqft_lot15     1 5.4919e+10 6.8996e+14 422061
## 
## Step:  AIC=421847
## price ~ sqft_living + good_neigh + waterfront + grade + yr_built + 
##     view + bedrooms
## 
##                 Df  Sum of Sq        RSS    AIC
## + bathrooms      1 3.8353e+12 6.7764e+14 421751
## + condition      1 2.7023e+12 6.7877e+14 421780
## + renovated      1 1.0256e+12 6.8045e+14 421823
## + sqft_above     1 9.6664e+11 6.8051e+14 421824
## + sqft_basement  1 9.6664e+11 6.8051e+14 421824
## + floors         1 6.3527e+11 6.8084e+14 421833
## + sqft_lot15     1 3.0405e+11 6.8117e+14 421841
## + sqft_living15  1 1.8797e+11 6.8129e+14 421844
## <none>                        6.8148e+14 421847
## + sqft_lot       1 1.7386e+08 6.8148e+14 421849
## 
## Step:  AIC=421751.4
## price ~ sqft_living + good_neigh + waterfront + grade + yr_built + 
##     view + bedrooms + bathrooms
## 
##                 Df  Sum of Sq        RSS    AIC
## + condition      1 2.5458e+12 6.7509e+14 421688
## + sqft_above     1 1.2960e+12 6.7634e+14 421720
## + sqft_basement  1 1.2960e+12 6.7634e+14 421720
## + renovated      1 5.6770e+11 6.7707e+14 421739
## + sqft_living15  1 4.1997e+11 6.7722e+14 421743
## + sqft_lot15     1 1.8687e+11 6.7745e+14 421749
## + floors         1 1.3107e+11 6.7751e+14 421750
## <none>                        6.7764e+14 421751
## + sqft_lot       1 5.1651e+09 6.7764e+14 421753
## 
## Step:  AIC=421688.3
## price ~ sqft_living + good_neigh + waterfront + grade + yr_built + 
##     view + bedrooms + bathrooms + condition
## 
##                 Df  Sum of Sq        RSS    AIC
## + sqft_above     1 1.7795e+12 6.7332e+14 421645
## + sqft_basement  1 1.7795e+12 6.7332e+14 421645
## + renovated      1 1.0676e+12 6.7403e+14 421663
## + sqft_living15  1 4.5095e+11 6.7464e+14 421679
## + floors         1 3.2529e+11 6.7477e+14 421682
## + sqft_lot15     1 2.0826e+11 6.7489e+14 421685
## <none>                        6.7509e+14 421688
## + sqft_lot       1 4.5832e+09 6.7509e+14 421690
## 
## Step:  AIC=421644.7
## price ~ sqft_living + good_neigh + waterfront + grade + yr_built + 
##     view + bedrooms + bathrooms + condition + sqft_above
## 
##                 Df  Sum of Sq        RSS    AIC
## + renovated      1 9.9938e+11 6.7232e+14 421621
## + sqft_lot15     1 3.0093e+11 6.7301e+14 421639
## + sqft_living15  1 2.2173e+11 6.7309e+14 421641
## <none>                        6.7332e+14 421645
## + floors         1 1.2372e+09 6.7331e+14 421647
## + sqft_lot       1 6.7932e+08 6.7331e+14 421647
## 
## Step:  AIC=421621
## price ~ sqft_living + good_neigh + waterfront + grade + yr_built + 
##     view + bedrooms + bathrooms + condition + sqft_above + renovated
## 
##                 Df  Sum of Sq        RSS    AIC
## + sqft_lot15     1 3.1404e+11 6.7200e+14 421615
## + sqft_living15  1 2.7048e+11 6.7205e+14 421616
## <none>                        6.7232e+14 421621
## + sqft_lot       1 9.6491e+08 6.7231e+14 421623
## + floors         1 1.2630e+08 6.7232e+14 421623
## 
## Step:  AIC=421615
## price ~ sqft_living + good_neigh + waterfront + grade + yr_built + 
##     view + bedrooms + bathrooms + condition + sqft_above + renovated + 
##     sqft_lot15
## 
##                 Df  Sum of Sq        RSS    AIC
## + sqft_living15  1 3.0923e+11 6.7169e+14 421609
## + sqft_lot       1 2.9174e+11 6.7171e+14 421609
## <none>                        6.7200e+14 421615
## + floors         1 4.0939e+09 6.7200e+14 421617
## 
## Step:  AIC=421609
## price ~ sqft_living + good_neigh + waterfront + grade + yr_built + 
##     view + bedrooms + bathrooms + condition + sqft_above + renovated + 
##     sqft_lot15 + sqft_living15
## 
##            Df  Sum of Sq        RSS    AIC
## + sqft_lot  1 3.1649e+11 6.7138e+14 421603
## <none>                   6.7169e+14 421609
## + floors    1 8.0363e+08 6.7169e+14 421611
## 
## Step:  AIC=421602.8
## price ~ sqft_living + good_neigh + waterfront + grade + yr_built + 
##     view + bedrooms + bathrooms + condition + sqft_above + renovated + 
##     sqft_lot15 + sqft_living15 + sqft_lot
## 
##          Df  Sum of Sq        RSS    AIC
## <none>                 6.7138e+14 421603
## + floors  1 2424591349 6.7137e+14 421605
## 
## Call:
## lm(formula = price ~ sqft_living + good_neigh + waterfront + 
##     grade + yr_built + view + bedrooms + bathrooms + condition + 
##     sqft_above + renovated + sqft_lot15 + sqft_living15 + sqft_lot, 
##     data = train2)
## 
## Coefficients:
##   (Intercept)    sqft_living     good_neigh     waterfront          grade  
##    -5.535e+05      1.603e+02      1.908e+05      6.392e+05      8.721e+04  
##      yr_built           view       bedrooms      bathrooms      condition  
##    -4.823e+04      4.781e+04     -3.451e+04      3.222e+04      2.423e+04  
##    sqft_above      renovated     sqft_lot15  sqft_living15       sqft_lot  
##     2.638e+01      4.198e+04     -3.387e-01      1.067e+01      1.643e-01
step(regfull, scope=list(lower=regnull, upper=regfull), direction="backward")
## Start:  AIC=421604.8
## price ~ bedrooms + bathrooms + sqft_living + sqft_lot + floors + 
##     waterfront + view + condition + grade + sqft_above + sqft_basement + 
##     yr_built + sqft_living15 + sqft_lot15 + good_neigh + renovated
## 
## 
## Step:  AIC=421604.8
## price ~ bedrooms + bathrooms + sqft_living + sqft_lot + floors + 
##     waterfront + view + condition + grade + sqft_above + yr_built + 
##     sqft_living15 + sqft_lot15 + good_neigh + renovated
## 
##                 Df  Sum of Sq        RSS    AIC
## - floors         1 2.4246e+09 6.7138e+14 421603
## <none>                        6.7137e+14 421605
## - sqft_lot       1 3.1811e+11 6.7169e+14 421611
## - sqft_living15  1 3.3412e+11 6.7171e+14 421611
## - sqft_lot15     1 6.6443e+11 6.7204e+14 421620
## - renovated      1 1.0684e+12 6.7244e+14 421630
## - sqft_above     1 1.1720e+12 6.7255e+14 421633
## - bathrooms      1 3.1610e+12 6.7453e+14 421684
## - condition      1 3.6071e+12 6.7498e+14 421695
## - bedrooms       1 1.0929e+13 6.8230e+14 421882
## - view           1 1.6588e+13 6.8796e+14 422025
## - yr_built       1 4.3150e+13 7.1452e+14 422680
## - waterfront     1 4.3530e+13 7.1490e+14 422689
## - sqft_living    1 4.4329e+13 7.1570e+14 422708
## - grade          1 5.3268e+13 7.2464e+14 422923
## - good_neigh     1 1.2264e+14 7.9402e+14 424504
## 
## Step:  AIC=421602.8
## price ~ bedrooms + bathrooms + sqft_living + sqft_lot + waterfront + 
##     view + condition + grade + sqft_above + yr_built + sqft_living15 + 
##     sqft_lot15 + good_neigh + renovated
## 
##                 Df  Sum of Sq        RSS    AIC
## <none>                        6.7138e+14 421603
## - sqft_lot       1 3.1649e+11 6.7169e+14 421609
## - sqft_living15  1 3.3398e+11 6.7171e+14 421609
## - sqft_lot15     1 6.6766e+11 6.7204e+14 421618
## - renovated      1 1.0742e+12 6.7245e+14 421628
## - sqft_above     1 1.4933e+12 6.7287e+14 421639
## - bathrooms      1 3.4356e+12 6.7481e+14 421689
## - condition      1 3.6152e+12 6.7499e+14 421694
## - bedrooms       1 1.0953e+13 6.8233e+14 421881
## - view           1 1.6649e+13 6.8802e+14 422024
## - waterfront     1 4.3541e+13 7.1492e+14 422687
## - yr_built       1 4.4574e+13 7.1595e+14 422712
## - sqft_living    1 4.7994e+13 7.1937e+14 422795
## - grade          1 5.3784e+13 7.2516e+14 422933
## - good_neigh     1 1.2534e+14 7.9671e+14 424560
## 
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + sqft_lot + 
##     waterfront + view + condition + grade + sqft_above + yr_built + 
##     sqft_living15 + sqft_lot15 + good_neigh + renovated, data = train2)
## 
## Coefficients:
##   (Intercept)       bedrooms      bathrooms    sqft_living       sqft_lot  
##    -5.535e+05     -3.451e+04      3.222e+04      1.603e+02      1.643e-01  
##    waterfront           view      condition          grade     sqft_above  
##     6.392e+05      4.781e+04      2.423e+04      8.721e+04      2.638e+01  
##      yr_built  sqft_living15     sqft_lot15     good_neigh      renovated  
##    -4.823e+04      1.067e+01     -3.387e-01      1.908e+05      4.198e+04

Total 14 colmns, were used in our model.

result5 <- lm(price ~ bedrooms + bathrooms + sqft_living + sqft_lot + 
    waterfront + view + condition + grade + sqft_above + yr_built + 
    sqft_living15 + sqft_lot15 + renovated + good_neigh, data = train2)

summary(result5)
## 
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + sqft_lot + 
##     waterfront + view + condition + grade + sqft_above + yr_built + 
##     sqft_living15 + sqft_lot15 + renovated + good_neigh, data = train2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1144540   -98940   -11414    73986  4394089 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -5.535e+05  1.665e+04 -33.235  < 2e-16 ***
## bedrooms      -3.451e+04  2.056e+03 -16.788  < 2e-16 ***
## bathrooms      3.222e+04  3.427e+03   9.402  < 2e-16 ***
## sqft_living    1.603e+02  4.563e+00  35.142  < 2e-16 ***
## sqft_lot       1.643e-01  5.758e-02   2.854  0.00433 ** 
## waterfront     6.392e+05  1.910e+04  33.471  < 2e-16 ***
## view           4.781e+04  2.310e+03  20.697  < 2e-16 ***
## condition      2.423e+04  2.512e+03   9.645  < 2e-16 ***
## grade          8.721e+04  2.344e+03  37.201  < 2e-16 ***
## sqft_above     2.638e+01  4.256e+00   6.199 5.83e-10 ***
## yr_built      -4.823e+04  1.424e+03 -33.866  < 2e-16 ***
## sqft_living15  1.067e+01  3.640e+00   2.931  0.00338 ** 
## sqft_lot15    -3.387e-01  8.171e-02  -4.145 3.42e-05 ***
## renovated      4.198e+04  7.985e+03   5.257 1.48e-07 ***
## good_neigh     1.908e+05  3.360e+03  56.790  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 197100 on 17275 degrees of freedom
## Multiple R-squared:  0.7066, Adjusted R-squared:  0.7064 
## F-statistic:  2972 on 14 and 17275 DF,  p-value: < 2.2e-16
summary(result5)$r.squared
## [1] 0.7065922
summary(result5)$adj.r.squared
## [1] 0.7063545
PRESS(result5)
## [1] 6.759204e+14
##Find SST 
anova_result<-anova(result5) 
SST<-sum(anova_result$"Sum Sq") 

##R2 pred 
Rsq_pred <- 1-PRESS(result5)/SST 
Rsq_pred
## [1] 0.7046062

However, when we take a look at the residual plot, constant variance is not satisfied.

yhat<-result5$fitted.values 
res<-result5$residuals
Data<-data.frame(train2,yhat,res)

ggplot(Data, aes(x=yhat,y=res))+
  geom_point()+
  geom_hline(yintercept=0, color="red")+
  labs(x="Fitted y",
       y="Residuals",
       title="Residual Plot")

acf(res)

qqnorm(res)
qqline(res, col="red")

To find optimal \(\lambda\) for y-transformation, we look at Box Cox plot, and \(\lambda = 0\), so let’s do log transformation on price.

boxcox(result5,lambda = seq(-1.,1,0.5))

Surprisingly, all of our result stats has increased significantly by implementing good neighbors column and log transformation on price.

train2 <- train2 %>% mutate(price = log(price))

result6 <- lm(price ~ bedrooms + bathrooms + sqft_living + sqft_lot + 
    waterfront + view + condition + grade + sqft_above + yr_built + 
    sqft_living15 + sqft_lot15 + renovated + good_neigh, data = train2)
summary(result6)
## 
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + sqft_lot + 
##     waterfront + view + condition + grade + sqft_above + yr_built + 
##     sqft_living15 + sqft_lot15 + renovated + good_neigh, data = train2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.83399 -0.15020 -0.00719  0.14830  1.03942 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1.108e+01  2.077e-02 533.458  < 2e-16 ***
## bedrooms      -1.112e-02  2.564e-03  -4.338 1.44e-05 ***
## bathrooms      7.205e-02  4.274e-03  16.860  < 2e-16 ***
## sqft_living    1.405e-04  5.691e-06  24.683  < 2e-16 ***
## sqft_lot       5.457e-07  7.181e-08   7.599 3.14e-14 ***
## waterfront     4.495e-01  2.382e-02  18.870  < 2e-16 ***
## view           5.621e-02  2.881e-03  19.513  < 2e-16 ***
## condition      4.954e-02  3.133e-03  15.811  < 2e-16 ***
## grade          1.416e-01  2.924e-03  48.429  < 2e-16 ***
## sqft_above     2.289e-05  5.309e-06   4.311 1.63e-05 ***
## yr_built      -5.801e-02  1.776e-03 -32.661  < 2e-16 ***
## sqft_living15  6.832e-05  4.540e-06  15.048  < 2e-16 ***
## sqft_lot15    -2.088e-08  1.019e-07  -0.205    0.838    
## renovated      6.888e-02  9.959e-03   6.916 4.79e-12 ***
## good_neigh     4.428e-01  4.190e-03 105.659  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2459 on 17275 degrees of freedom
## Multiple R-squared:  0.7817, Adjusted R-squared:  0.7815 
## F-statistic:  4419 on 14 and 17275 DF,  p-value: < 2.2e-16
test2$predict <- round(exp(predict(result6, newdata = test2)))
test_mse_ln_6 <- mean((test2$price - test2$predict)^2)
test_mse_ln_6
## [1] 35394126054
summary(result6)$r.squared
## [1] 0.781717
summary(result6)$adj.r.squared
## [1] 0.7815401
PRESS(result6)
## [1] 1047.15
##Find SST 
anova_result<-anova(result6) 
SST<-sum(anova_result$"Sum Sq") 

##R2 pred 
Rsq_pred <- 1-PRESS(result6)/SST 
Rsq_pred
## [1] 0.7811394
yhat<-result6$fitted.values 
res<-result6$residuals
Data<-data.frame(train2,yhat,res)

ggplot(Data, aes(x=yhat,y=res))+
  geom_point()+
  geom_hline(yintercept=0, color="red")+
  labs(x="Fitted y",
       y="Residuals",
       title="Residual Plot")

acf(res)

qqnorm(res)
qqline(res, col="red")

However, as the p-value for sqft_lot15 predictor is high, let’s drop that and re-model it. As our model all satisfied regression assumption, and pretty good result, this is our final model.

\(y^* = 1.108e+01 -1.110e-02x_{bedrooms}+ 7.208e-02x_{bathrooms} + 1.404e-04x_{sqft_living} + 5.351e-07x_{sqft_lot} + 4.494e-01x_{waterfront} + 5.622e-02x_{view} + 4.952e-02x_{condition} + 1.416e-01x_{grade} + 2.287e-05x_{sqft_above} -5.802e-02x_{yr_built} + 6.825e-05x_{sqft_living15} + 6.885e-02x_{renovated} + 4.428e-01x_{good_neigh}\), where \(y^* = log(y)\)

result7 <- lm(price ~ bedrooms + bathrooms + sqft_living + sqft_lot + 
    waterfront + view + condition + grade + sqft_above + yr_built + 
    sqft_living15 + renovated + good_neigh, data = train2)
summary(result7)
## 
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + sqft_lot + 
##     waterfront + view + condition + grade + sqft_above + yr_built + 
##     sqft_living15 + renovated + good_neigh, data = train2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.83481 -0.15019 -0.00713  0.14838  1.03943 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1.108e+01  2.076e-02 533.668  < 2e-16 ***
## bedrooms      -1.110e-02  2.562e-03  -4.334 1.47e-05 ***
## bathrooms      7.208e-02  4.271e-03  16.879  < 2e-16 ***
## sqft_living    1.404e-04  5.688e-06  24.690  < 2e-16 ***
## sqft_lot       5.351e-07  4.964e-08  10.778  < 2e-16 ***
## waterfront     4.494e-01  2.381e-02  18.870  < 2e-16 ***
## view           5.622e-02  2.881e-03  19.514  < 2e-16 ***
## condition      4.952e-02  3.132e-03  15.811  < 2e-16 ***
## grade          1.416e-01  2.922e-03  48.465  < 2e-16 ***
## sqft_above     2.287e-05  5.308e-06   4.309 1.65e-05 ***
## yr_built      -5.802e-02  1.775e-03 -32.681  < 2e-16 ***
## sqft_living15  6.825e-05  4.528e-06  15.072  < 2e-16 ***
## renovated      6.885e-02  9.957e-03   6.914 4.87e-12 ***
## good_neigh     4.428e-01  4.181e-03 105.903  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2459 on 17276 degrees of freedom
## Multiple R-squared:  0.7817, Adjusted R-squared:  0.7816 
## F-statistic:  4759 on 13 and 17276 DF,  p-value: < 2.2e-16
test2$predict <- round(exp(predict(result7, newdata = test2)))
head(test2)
##      price bedrooms bathrooms sqft_living sqft_lot floors waterfront view
## 5   510000        3      2.00        1680     8080      1          0    0
## 9   229500        3      1.00        1780     7470      1          0    0
## 10  323000        3      2.50        1890     6560      2          0    0
## 12  468000        2      1.00        1160     6000      1          0    0
## 17  395000        3      2.00        1890    14040      2          0    0
## 22 2000000        3      2.75        3050    44867      1          0    4
##    condition grade sqft_above sqft_basement yr_built sqft_living15 sqft_lot15
## 5          3     8       1680             0        4          1800       7503
## 9          3     7       1050           730        3          1780       8113
## 10         3     7       1890             0        5          2390       7570
## 12         4     7        860           300        2          1330       6000
## 17         3     7       1890             0        4          1890      14018
## 22         3     9       2330           720        3          4110      20336
##    good_neigh renovated predict
## 5           1         0  481307
## 9           0         0  264003
## 10          0         0  282549
## 12          1         0  409323
## 17          0         0  280257
## 22          1         0 1141010
test_mse_ln_7 <- mean((test2$price - test2$predict)^2)
test_mse_ln_7
## [1] 35383710608
summary(result7)$r.squared
## [1] 0.7817165
summary(result7)$adj.r.squared
## [1] 0.7815522
PRESS(result7)
## [1] 1046.815
##Find SST 
anova_result<-anova(result7) 
SST<-sum(anova_result$"Sum Sq") 

##R2 pred 
Rsq_pred <- 1-PRESS(result7)/SST 
Rsq_pred
## [1] 0.7812095
yhat<-result7$fitted.values 
res<-result7$residuals
Data<-data.frame(train2,yhat,res)

ggplot(Data, aes(x=yhat,y=res))+
  geom_point()+
  geom_hline(yintercept=0, color="red")+
  labs(x="Fitted y",
       y="Residuals",
       title="Residual Plot")

acf(res)

qqnorm(res)
qqline(res, col="red")